Re: Term vs. token

Jack Krupansky Wed, 20 Apr 2016 12:49:31 -0700

I gather that "term" is the proper technical term within the Vector Space
Model (TDIFS) and BM25 similarity, so it may simply be a question of where
the boundary is in Lucene between VSM processing and other stuff, like the
source for documents and queries.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 1:51 PM, Ryan Josal <[email protected]> wrote:

> My understanding is a Term is comprised of a "token" and a field.  So then
> the documentation makes sense to me - return the count of tokens in a field
> for example.  But there were a couple of references you had there that
> don't match with that definition, like the number of tokens in a
> collection.  Although maybe a Term doesn't have a whole token because what
> about token attributes like payload.  I guess I've convinced myself I'm not
> entirely clear about it either, but I do feel good about the concept that
> tokens don't have fields.  You can tokenize a string without thinking about
> fields, and they become terms with fields when you query.
>
> Ryan
>
>
> On Wednesday, April 20, 2016, Jack Krupansky <[email protected]>
> wrote:
>
>> Looking at the Lucene Similarity Javadoc, I see some references to
>> tokens, but I am wondering if that is intentional or whether those should
>> really be references to terms.
>>
>> For example:
>>
>>  *        <li><b>lengthNorm</b> - computed
>>  *        when the document is added to the index in accordance with the
>> number of tokens
>>  *        of this field in the document, so that shorter fields
>> contribute more to the score.
>>
>> I think that should be terms, not tokens.
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/TFIDFSimilarity.java#L466
>>
>> And this:
>>
>>    * Returns the total number of tokens in the field.
>>    * @see Terms#getSumTotalTermFreq()
>>    */
>>   public long getNumberOfFieldTokens() {
>>     return numberOfFieldTokens;
>>
>> I think that should be terms as well:
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java#L65
>>
>> And... this:
>>
>>       numberOfFieldTokens = sumTotalTermFreq;
>>
>> Where it is clearly starting with terms and treating them as tokens, but
>> as in the previous example, I think that should be terms as well.
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java#L128
>>
>> One last example:
>>
>>    * Compute any collection-level weight (e.g. IDF, average document
>> length, etc) needed for scoring a query.
>>    *
>>    * @param collectionStats collection-level statistics, such as the
>> number of tokens in the collection.
>>    * @param termStats term-level statistics, such as the document
>> frequency of a term across the collection.
>>    * @return SimWeight object with the information this Similarity needs
>> to score a query.
>>    */
>>   public abstract SimWeight computeWeight(CollectionStatistics
>> collectionStats, TermStatistics... termStats);
>>
>> See:
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java#L161
>>
>> In fact, CollectionStatistics uses term, not token:
>>
>>   /** returns the total number of tokens for this field
>>    * @see Terms#getSumTotalTermFreq() */
>>   public final long sumTotalTermFreq() {
>>     return sumTotalTermFreq;
>>
>> Oops... it uses both, emphasizing my point about the confusion.
>>
>> There are other examples as well.
>>
>> My understanding is that tokens are merely a temporary transition in
>> between the original raw source text for a field and then final terms to be
>> indexed (or query terms from a parsed and analyzed query.) Yes, during and
>> within TokenStream or the analyzer we speak of tokens and intermediate
>> string values are referred to as tokens, but once the final string value is
>> retrieved from the Token Stream (analyzer), it's a term.
>>
>> In any case, is there some distinction in any of these cited examples (or
>> other examples in this or related code) where "token" is an important
>> distinction to be made and "term" is not the proper... term... to be used?
>>
>> Unless the Lucene project fully intends that the terms token and term are
>> absolutely synonymous, a clear distinction should be drawn... I think. Or
>> at least the terms should be used consistently, which my last example
>> highlights.
>>
>> Thanks.
>>
>> -- Jack Krupansky
>>
>

Re: Term vs. token

Reply via email to