Re: Term vs. token
I gather that "term" is the proper technical term within the Vector Space Model (TDIFS) and BM25 similarity, so it may simply be a question of where the boundary is in Lucene between VSM processing and other stuff, like the source for documents and queries. -- Jack Krupansky On Wed, Apr 20, 2016 at 1:51 PM, Ryan Josalwrote: > My understanding is a Term is comprised of a "token" and a field. So then > the documentation makes sense to me - return the count of tokens in a field > for example. But there were a couple of references you had there that > don't match with that definition, like the number of tokens in a > collection. Although maybe a Term doesn't have a whole token because what > about token attributes like payload. I guess I've convinced myself I'm not > entirely clear about it either, but I do feel good about the concept that > tokens don't have fields. You can tokenize a string without thinking about > fields, and they become terms with fields when you query. > > Ryan > > > On Wednesday, April 20, 2016, Jack Krupansky > wrote: > >> Looking at the Lucene Similarity Javadoc, I see some references to >> tokens, but I am wondering if that is intentional or whether those should >> really be references to terms. >> >> For example: >> >> *lengthNorm - computed >> *when the document is added to the index in accordance with the >> number of tokens >> *of this field in the document, so that shorter fields >> contribute more to the score. >> >> I think that should be terms, not tokens. >> >> See: >> >> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466 >> >> And this: >> >>* Returns the total number of tokens in the field. >>* @see Terms#getSumTotalTermFreq() >>*/ >> public long getNumberOfFieldTokens() { >> return numberOfFieldTokens; >> >> I think that should be terms as well: >> >> See: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65 >> >> And... this: >> >> numberOfFieldTokens = sumTotalTermFreq; >> >> Where it is clearly starting with terms and treating them as tokens, but >> as in the previous example, I think that should be terms as well. >> >> See: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128 >> >> One last example: >> >>* Compute any collection-level weight (e.g. IDF, average document >> length, etc) needed for scoring a query. >>* >>* @param collectionStats collection-level statistics, such as the >> number of tokens in the collection. >>* @param termStats term-level statistics, such as the document >> frequency of a term across the collection. >>* @return SimWeight object with the information this Similarity needs >> to score a query. >>*/ >> public abstract SimWeight computeWeight(CollectionStatistics >> collectionStats, TermStatistics... termStats); >> >> See: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161 >> >> In fact, CollectionStatistics uses term, not token: >> >> /** returns the total number of tokens for this field >>* @see Terms#getSumTotalTermFreq() */ >> public final long sumTotalTermFreq() { >> return sumTotalTermFreq; >> >> Oops... it uses both, emphasizing my point about the confusion. >> >> There are other examples as well. >> >> My understanding is that tokens are merely a temporary transition in >> between the original raw source text for a field and then final terms to be >> indexed (or query terms from a parsed and analyzed query.) Yes, during and >> within TokenStream or the analyzer we speak of tokens and intermediate >> string values are referred to as tokens, but once the final string value is >> retrieved from the Token Stream (analyzer), it's a term. >> >> In any case, is there some distinction in any of these cited examples (or >> other examples in this or related code) where "token" is an important >> distinction to be made and "term" is not the proper... term... to be used? >> >> Unless the Lucene project fully intends that the terms token and term are >> absolutely synonymous, a clear distinction should be drawn... I think. Or >> at least the terms should be used consistently, which my last example >> highlights. >> >> Thanks. >> >> -- Jack Krupansky >> >
Re: Term vs. token
My understanding is a Term is comprised of a "token" and a field. So then the documentation makes sense to me - return the count of tokens in a field for example. But there were a couple of references you had there that don't match with that definition, like the number of tokens in a collection. Although maybe a Term doesn't have a whole token because what about token attributes like payload. I guess I've convinced myself I'm not entirely clear about it either, but I do feel good about the concept that tokens don't have fields. You can tokenize a string without thinking about fields, and they become terms with fields when you query. Ryan On Wednesday, April 20, 2016, Jack Krupanskywrote: > Looking at the Lucene Similarity Javadoc, I see some references to tokens, > but I am wondering if that is intentional or whether those should really be > references to terms. > > For example: > > *lengthNorm - computed > *when the document is added to the index in accordance with the > number of tokens > *of this field in the document, so that shorter fields contribute > more to the score. > > I think that should be terms, not tokens. > > See: > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466 > > And this: > >* Returns the total number of tokens in the field. >* @see Terms#getSumTotalTermFreq() >*/ > public long getNumberOfFieldTokens() { > return numberOfFieldTokens; > > I think that should be terms as well: > > See: > > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65 > > And... this: > > numberOfFieldTokens = sumTotalTermFreq; > > Where it is clearly starting with terms and treating them as tokens, but > as in the previous example, I think that should be terms as well. > > See: > > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128 > > One last example: > >* Compute any collection-level weight (e.g. IDF, average document > length, etc) needed for scoring a query. >* >* @param collectionStats collection-level statistics, such as the > number of tokens in the collection. >* @param termStats term-level statistics, such as the document > frequency of a term across the collection. >* @return SimWeight object with the information this Similarity needs > to score a query. >*/ > public abstract SimWeight computeWeight(CollectionStatistics > collectionStats, TermStatistics... termStats); > > See: > > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161 > > In fact, CollectionStatistics uses term, not token: > > /** returns the total number of tokens for this field >* @see Terms#getSumTotalTermFreq() */ > public final long sumTotalTermFreq() { > return sumTotalTermFreq; > > Oops... it uses both, emphasizing my point about the confusion. > > There are other examples as well. > > My understanding is that tokens are merely a temporary transition in > between the original raw source text for a field and then final terms to be > indexed (or query terms from a parsed and analyzed query.) Yes, during and > within TokenStream or the analyzer we speak of tokens and intermediate > string values are referred to as tokens, but once the final string value is > retrieved from the Token Stream (analyzer), it's a term. > > In any case, is there some distinction in any of these cited examples (or > other examples in this or related code) where "token" is an important > distinction to be made and "term" is not the proper... term... to be used? > > Unless the Lucene project fully intends that the terms token and term are > absolutely synonymous, a clear distinction should be drawn... I think. Or > at least the terms should be used consistently, which my last example > highlights. > > Thanks. > > -- Jack Krupansky >
Term vs. token
Looking at the Lucene Similarity Javadoc, I see some references to tokens, but I am wondering if that is intentional or whether those should really be references to terms. For example: *lengthNorm - computed *when the document is added to the index in accordance with the number of tokens *of this field in the document, so that shorter fields contribute more to the score. I think that should be terms, not tokens. See: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466 And this: * Returns the total number of tokens in the field. * @see Terms#getSumTotalTermFreq() */ public long getNumberOfFieldTokens() { return numberOfFieldTokens; I think that should be terms as well: See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65 And... this: numberOfFieldTokens = sumTotalTermFreq; Where it is clearly starting with terms and treating them as tokens, but as in the previous example, I think that should be terms as well. See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128 One last example: * Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query. * * @param collectionStats collection-level statistics, such as the number of tokens in the collection. * @param termStats term-level statistics, such as the document frequency of a term across the collection. * @return SimWeight object with the information this Similarity needs to score a query. */ public abstract SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats); See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161 In fact, CollectionStatistics uses term, not token: /** returns the total number of tokens for this field * @see Terms#getSumTotalTermFreq() */ public final long sumTotalTermFreq() { return sumTotalTermFreq; Oops... it uses both, emphasizing my point about the confusion. There are other examples as well. My understanding is that tokens are merely a temporary transition in between the original raw source text for a field and then final terms to be indexed (or query terms from a parsed and analyzed query.) Yes, during and within TokenStream or the analyzer we speak of tokens and intermediate string values are referred to as tokens, but once the final string value is retrieved from the Token Stream (analyzer), it's a term. In any case, is there some distinction in any of these cited examples (or other examples in this or related code) where "token" is an important distinction to be made and "term" is not the proper... term... to be used? Unless the Lucene project fully intends that the terms token and term are absolutely synonymous, a clear distinction should be drawn... I think. Or at least the terms should be used consistently, which my last example highlights. Thanks. -- Jack Krupansky