Term vs. token

Jack Krupansky Wed, 20 Apr 2016 09:05:25 -0700

Looking at the Lucene Similarity Javadoc, I see some references to tokens,
but I am wondering if that is intentional or whether those should really be
references to terms.


For example:

 *        <li><b>lengthNorm</b> - computed
 *        when the document is added to the index in accordance with the
number of tokens
 *        of this field in the document, so that shorter fields contribute
more to the score.

I think that should be terms, not tokens.

See:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466

And this:

   * Returns the total number of tokens in the field.
   * @see Terms#getSumTotalTermFreq()
   */
  public long getNumberOfFieldTokens() {
    return numberOfFieldTokens;

I think that should be terms as well:

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65

And... this:

      numberOfFieldTokens = sumTotalTermFreq;

Where it is clearly starting with terms and treating them as tokens, but as
in the previous example, I think that should be terms as well.

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128

One last example:

   * Compute any collection-level weight (e.g. IDF, average document
length, etc) needed for scoring a query.
   *
   * @param collectionStats collection-level statistics, such as the number
of tokens in the collection.
   * @param termStats term-level statistics, such as the document frequency
of a term across the collection.
   * @return SimWeight object with the information this Similarity needs to
score a query.
   */
  public abstract SimWeight computeWeight(CollectionStatistics
collectionStats, TermStatistics... termStats);

See:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161

In fact, CollectionStatistics uses term, not token:

  /** returns the total number of tokens for this field
   * @see Terms#getSumTotalTermFreq() */
  public final long sumTotalTermFreq() {
    return sumTotalTermFreq;

Oops... it uses both, emphasizing my point about the confusion.

There are other examples as well.

My understanding is that tokens are merely a temporary transition in
between the original raw source text for a field and then final terms to be
indexed (or query terms from a parsed and analyzed query.) Yes, during and
within TokenStream or the analyzer we speak of tokens and intermediate
string values are referred to as tokens, but once the final string value is
retrieved from the Token Stream (analyzer), it's a term.

In any case, is there some distinction in any of these cited examples (or
other examples in this or related code) where "token" is an important
distinction to be made and "term" is not the proper... term... to be used?

Unless the Lucene project fully intends that the terms token and term are
absolutely synonymous, a clear distinction should be drawn... I think. Or
at least the terms should be used consistently, which my last example
highlights.

Thanks.

-- Jack Krupansky

Term vs. token

Reply via email to