I gather that "term" is the proper technical term within the Vector Space Model (TDIFS) and BM25 similarity, so it may simply be a question of where the boundary is in Lucene between VSM processing and other stuff, like the source for documents and queries.
-- Jack Krupansky On Wed, Apr 20, 2016 at 1:51 PM, Ryan Josal <[email protected]> wrote: > My understanding is a Term is comprised of a "token" and a field. So then > the documentation makes sense to me - return the count of tokens in a field > for example. But there were a couple of references you had there that > don't match with that definition, like the number of tokens in a > collection. Although maybe a Term doesn't have a whole token because what > about token attributes like payload. I guess I've convinced myself I'm not > entirely clear about it either, but I do feel good about the concept that > tokens don't have fields. You can tokenize a string without thinking about > fields, and they become terms with fields when you query. > > Ryan > > > On Wednesday, April 20, 2016, Jack Krupansky <[email protected]> > wrote: > >> Looking at the Lucene Similarity Javadoc, I see some references to >> tokens, but I am wondering if that is intentional or whether those should >> really be references to terms. >> >> For example: >> >> * <li><b>lengthNorm</b> - computed >> * when the document is added to the index in accordance with the >> number of tokens >> * of this field in the document, so that shorter fields >> contribute more to the score. >> >> I think that should be terms, not tokens. >> >> See: >> >> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466 >> >> And this: >> >> * Returns the total number of tokens in the field. >> * @see Terms#getSumTotalTermFreq() >> */ >> public long getNumberOfFieldTokens() { >> return numberOfFieldTokens; >> >> I think that should be terms as well: >> >> See: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65 >> >> And... this: >> >> numberOfFieldTokens = sumTotalTermFreq; >> >> Where it is clearly starting with terms and treating them as tokens, but >> as in the previous example, I think that should be terms as well. >> >> See: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128 >> >> One last example: >> >> * Compute any collection-level weight (e.g. IDF, average document >> length, etc) needed for scoring a query. >> * >> * @param collectionStats collection-level statistics, such as the >> number of tokens in the collection. >> * @param termStats term-level statistics, such as the document >> frequency of a term across the collection. >> * @return SimWeight object with the information this Similarity needs >> to score a query. >> */ >> public abstract SimWeight computeWeight(CollectionStatistics >> collectionStats, TermStatistics... termStats); >> >> See: >> >> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161 >> >> In fact, CollectionStatistics uses term, not token: >> >> /** returns the total number of tokens for this field >> * @see Terms#getSumTotalTermFreq() */ >> public final long sumTotalTermFreq() { >> return sumTotalTermFreq; >> >> Oops... it uses both, emphasizing my point about the confusion. >> >> There are other examples as well. >> >> My understanding is that tokens are merely a temporary transition in >> between the original raw source text for a field and then final terms to be >> indexed (or query terms from a parsed and analyzed query.) Yes, during and >> within TokenStream or the analyzer we speak of tokens and intermediate >> string values are referred to as tokens, but once the final string value is >> retrieved from the Token Stream (analyzer), it's a term. >> >> In any case, is there some distinction in any of these cited examples (or >> other examples in this or related code) where "token" is an important >> distinction to be made and "term" is not the proper... term... to be used? >> >> Unless the Lucene project fully intends that the terms token and term are >> absolutely synonymous, a clear distinction should be drawn... I think. Or >> at least the terms should be used consistently, which my last example >> highlights. >> >> Thanks. >> >> -- Jack Krupansky >> >
