[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

Ahmet Arslan (JIRA) Tue, 04 Aug 2015 04:24:56 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmet Arslan updated LUCENE-6711:
---------------------------------
    Attachment: LUCENE-6711.patch

Patch that includes following migrate entry. But I am not sure this is an 
appropriate text for migrate.txt.
{panel:title=The way how number of document calculated is changed 
(LUCENE-6711)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
The number of documents (docCount) is used to calculate term specificity (idf) 
and average document length (avdl). Prior to LUCENE-6711, 
collectionStats.maxDoc() was used for the statistics. Now, 
collectionStats.docCount() is used whenever possible, if not maxDocs() is used.

Assume that a collection contains 100 documents, and 50 of them have "keywords" 
field. In this example, maxDocs is 100 while docCount is 50 for the "keywords" 
field. The total number of tokens for "keywords" field is divided by docCount 
to obtain avdl. Therefore, docCount which is the total number of documents that 
have at least one term for the field, is a more precise metric for optional 
fields.

DefaultSimilarity does not leverage avdl, so this change would have relatively 
minor change in the result list. Because relative idf values of terms will 
remain same. However, when combined with other factors such as term frequency, 
relative ranking of documents could change. Some Similarity implementations 
(such as the ones instantiated with NormalizationH2 and BM25) take account into 
avdl and would have notable change in ranked list. Especially if you have a 
collection of documents with varying lengths. Because NormalizationH2 tends to 
punish documents longer than avdl.
{panel}

> Instead of docCount(), maxDoc() is used for numberOfDocuments in 
> SimilarityBase
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-6711
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6711
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Ahmet Arslan
>            Priority: Minor
>             Fix For: Trunk
>
>         Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
> LUCENE-6711.patch
>
>
> {{SimilarityBase.java}} has the following line :
> {code}
>  long numberOfDocuments = collectionStats.maxDoc();
> {code}
> It seems like {{collectionStats.docCount()}}, which returns the total number 
> of documents that have at least one term for this field, is more appropriate 
> statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

Reply via email to