Robert Muir created LUCENE-5200:
-----------------------------------

             Summary: HighFreqTerms has confusing behavior with -t option
                 Key: LUCENE-5200
                 URL: https://issues.apache.org/jira/browse/LUCENE-5200
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/other
            Reporter: Robert Muir


{code}
 * <code>HighFreqTerms</code> class extracts the top n most frequent terms
 * (by document frequency) from an existing Lucene index and reports their
 * document frequency.
 * <p>
 * If the -t flag is given, both document frequency and total tf (total
 * number of occurrences) are reported, ordered by descending total tf.
{code}

Problem #1:
Its tricky what happens with -t: if you ask for the top-100 terms, it requests 
the top-100 terms (by docFreq), then resorts the top-N by totalTermFreq.

So its not really the top 100 most frequently occurring terms.

Problem #2: 
Using the -t option can be confusing and slow: the reported docFreq includes 
deletions, but totalTermFreq does not (it actually walks postings lists if 
there is even one deletion).

I think this is a relic from 3.x days when lucene did not support this 
statistic. I think we should just always output both TermsEnum.docFreq() and 
TermsEnum.totalTermFreq(), and -t just determines the comparator of the PQ.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to