[ 
https://issues.apache.org/jira/browse/LUCENE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760296#comment-13760296
 ] 

ASF subversion and git services commented on LUCENE-5200:
---------------------------------------------------------

Commit 1520616 from [~rcmuir] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1520616 ]

LUCENE-5200: HighFreqTerms has confusing behavior with -t option
                
> HighFreqTerms has confusing behavior with -t option
> ---------------------------------------------------
>
>                 Key: LUCENE-5200
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5200
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/other
>            Reporter: Robert Muir
>         Attachments: LUCENE-5200.patch
>
>
> {code}
>  * <code>HighFreqTerms</code> class extracts the top n most frequent terms
>  * (by document frequency) from an existing Lucene index and reports their
>  * document frequency.
>  * <p>
>  * If the -t flag is given, both document frequency and total tf (total
>  * number of occurrences) are reported, ordered by descending total tf.
> {code}
> Problem #1:
> Its tricky what happens with -t: if you ask for the top-100 terms, it 
> requests the top-100 terms (by docFreq), then resorts the top-N by 
> totalTermFreq.
> So its not really the top 100 most frequently occurring terms.
> Problem #2: 
> Using the -t option can be confusing and slow: the reported docFreq includes 
> deletions, but totalTermFreq does not (it actually walks postings lists if 
> there is even one deletion).
> I think this is a relic from 3.x days when lucene did not support this 
> statistic. I think we should just always output both TermsEnum.docFreq() and 
> TermsEnum.totalTermFreq(), and -t just determines the comparator of the PQ.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to