Looking at the Lucene Similarity Javadoc, I see some references to tokens, but I am wondering if that is intentional or whether those should really be references to terms.
For example: * <li><b>lengthNorm</b> - computed * when the document is added to the index in accordance with the number of tokens * of this field in the document, so that shorter fields contribute more to the score. I think that should be terms, not tokens. See: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466 And this: * Returns the total number of tokens in the field. * @see Terms#getSumTotalTermFreq() */ public long getNumberOfFieldTokens() { return numberOfFieldTokens; I think that should be terms as well: See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65 And... this: numberOfFieldTokens = sumTotalTermFreq; Where it is clearly starting with terms and treating them as tokens, but as in the previous example, I think that should be terms as well. See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128 One last example: * Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query. * * @param collectionStats collection-level statistics, such as the number of tokens in the collection. * @param termStats term-level statistics, such as the document frequency of a term across the collection. * @return SimWeight object with the information this Similarity needs to score a query. */ public abstract SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats); See: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161 In fact, CollectionStatistics uses term, not token: /** returns the total number of tokens for this field * @see Terms#getSumTotalTermFreq() */ public final long sumTotalTermFreq() { return sumTotalTermFreq; Oops... it uses both, emphasizing my point about the confusion. There are other examples as well. My understanding is that tokens are merely a temporary transition in between the original raw source text for a field and then final terms to be indexed (or query terms from a parsed and analyzed query.) Yes, during and within TokenStream or the analyzer we speak of tokens and intermediate string values are referred to as tokens, but once the final string value is retrieved from the Token Stream (analyzer), it's a term. In any case, is there some distinction in any of these cited examples (or other examples in this or related code) where "token" is an important distinction to be made and "term" is not the proper... term... to be used? Unless the Lucene project fully intends that the terms token and term are absolutely synonymous, a clear distinction should be drawn... I think. Or at least the terms should be used consistently, which my last example highlights. Thanks. -- Jack Krupansky