[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706980#comment-14706980 ] Robert Muir commented on LUCENE-6711: - I dont think its a bug with this. Likely the typical bugs from crappy useless querynorm, and exposed by shaking things up. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706970#comment-14706970 ] Hoss Man commented on LUCENE-6711: -- possible bug identified by Terry Smith in LUCENE-6758 Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705121#comment-14705121 ] Robert Muir commented on LUCENE-6711: - Thanks. I think it found an unrelated bug. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705090#comment-14705090 ] Steve Rowe commented on LUCENE-6711: {{testNoFieldSkew()}} failure: https://issues.apache.org/jira/browse/LUCENE-6751 Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705117#comment-14705117 ] ASF subversion and git services commented on LUCENE-6711: - Commit 1696807 from [~rcmuir] in branch 'dev/trunk' [ https://svn.apache.org/r1696807 ] LUCENE-6711: improve test when it fails Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-6711: --- Assignee: Robert Muir Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695631#comment-14695631 ] ASF subversion and git services commented on LUCENE-6711: - Commit 1695744 from [~rcmuir] in branch 'dev/trunk' [ https://svn.apache.org/r1695744 ] LUCENE-6711: Use CollectionStatistics.docCount() for IDF and average field length computations Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695633#comment-14695633 ] Robert Muir commented on LUCENE-6711: - Thanks Ahmet, I committed this. I only made cosmetic changes: I renamed local variable and parameter names to be docCount because numDocs is pretty confusing. I also added a test case for all of our similarities. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-6711. - Resolution: Fixed Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Assignee: Robert Muir Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmet Arslan updated LUCENE-6711: - Attachment: LUCENE-6711.patch Patch that includes following migrate entry. But I am not sure this is an appropriate text for migrate.txt. {panel:title=The way how number of document calculated is changed (LUCENE-6711)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE} The number of documents (docCount) is used to calculate term specificity (idf) and average document length (avdl). Prior to LUCENE-6711, collectionStats.maxDoc() was used for the statistics. Now, collectionStats.docCount() is used whenever possible, if not maxDocs() is used. Assume that a collection contains 100 documents, and 50 of them have keywords field. In this example, maxDocs is 100 while docCount is 50 for the keywords field. The total number of tokens for keywords field is divided by docCount to obtain avdl. Therefore, docCount which is the total number of documents that have at least one term for the field, is a more precise metric for optional fields. DefaultSimilarity does not leverage avdl, so this change would have relatively minor change in the result list. Because relative idf values of terms will remain same. However, when combined with other factors such as term frequency, relative ranking of documents could change. Some Similarity implementations (such as the ones instantiated with NormalizationH2 and BM25) take account into avdl and would have notable change in ranked list. Especially if you have a collection of documents with varying lengths. Because NormalizationH2 tends to punish documents longer than avdl. {panel} Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmet Arslan updated LUCENE-6711: - Attachment: LUCENE-6711.patch Includes changes to TFIDF and BM25, {{ant clean test}} passes. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652160#comment-14652160 ] Hoss Man commented on LUCENE-6711: -- bq. I also think its best to just make the change for trunk and not do it in a minor version. +1 This is the sort of behavior change that should be noted in MIGRATE.txt -- Ahmet: could you take a stab at adding the necessary text in your patch? Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-6711: - Affects Version/s: (was: 5.2.1) Fix Version/s: (was: 5.3) Trunk Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Reporter: Ahmet Arslan Priority: Minor Fix For: Trunk Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651839#comment-14651839 ] Robert Muir commented on LUCENE-6711: - yes, thats right Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmet Arslan updated LUCENE-6711: - Attachment: LUCENE-6711.patch This patch checks for -1 and uses maxDoc() if docCount() is not unsupported. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651758#comment-14651758 ] Ahmet Arslan commented on LUCENE-6711: -- bq. We should fix TFIDFSimilarity and BM25Similarity too. For TFIDF and BM25, do we simply replace {code}collectionStats.maxDoc(){code} with {code}collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount(){code} ? Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch, LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: numberOfDocuments in SimilarityBase
Hi Robert, Thanks for chiming in, I created LUCENE-6711 for this. Ahmet On Thursday, July 30, 2015 4:47 PM, Robert Muir rcm...@gmail.com wrote: I think so. When adding this statistic (lucene 4.0), personally I really wanted to fix it everywhere. But we had the problem of backwards compatibility, and its bad to use different formulas for different segments even if it works... Nowadays we dont have lucene 3 segments around anymore, so I think we should fix this. Want to open an issue? On Wed, Jul 29, 2015 at 10:45 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hello List, SimilarityBase uses CollectionStatistics#maxDoc() for numberOfDocuments. Shouldn't it be field-based CollectionStatistics#docCount()? --- core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java (revision 1693268) +++ core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java (working copy) @@ -102,7 +102,7 @@ protected void fillBasicStats(BasicStats stats, CollectionStatistics collectionStats, TermStatistics termStats) { // #positions(field) must be = #positions(term) assert collectionStats.sumTotalTermFreq() == -1 || collectionStats.sumTotalTermFreq() = termStats.totalTermFreq(); -long numberOfDocuments = collectionStats.maxDoc(); +long numberOfDocuments = collectionStats.docCount(); Thanks, Ahmet - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ahmet Arslan updated LUCENE-6711: - Attachment: LUCENE-6711.patch Patch that includes suggested change. However, this breaks most of the tests in {{TestSimilarityBase}}. What is the preferred course of action here? Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
Ahmet Arslan created LUCENE-6711: Summary: Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650499#comment-14650499 ] Robert Muir commented on LUCENE-6711: - IndexReader/Terms etc still document this as an optional statistic: I think we should keep it that way. E.G. maybe its hard to compute for some FilterReader, who knows. So I think we should do a fallback like the other statistics: check for -1 and use maxDoc if its unsupported. But I think its a good time to make the change. For ordinary users, it will not be trappy/happen incrementally: all these statistics have been supported since 4.0. We should fix TFIDFSimilarity and BM25Similarity too. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650510#comment-14650510 ] Robert Muir commented on LUCENE-6711: - This is already a pluggable api so someone can do that if they want: lets not make our code complicated. I also think its best to just make the change for trunk and not do it in a minor version. Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase
[ https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650500#comment-14650500 ] Upayavira commented on LUCENE-6711: --- I've often wondered the same sort of thing. Now, given that this will likely change the score for every single query anyone does on any Lucene based search, would it be possible to make this configurable, so that people can choose which one they want? More particularly, to choose the point at which their scoring will change? Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase --- Key: LUCENE-6711 URL: https://issues.apache.org/jira/browse/LUCENE-6711 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 5.2.1 Reporter: Ahmet Arslan Priority: Minor Fix For: 5.3 Attachments: LUCENE-6711.patch {{SimilarityBase.java}} has the following line : {code} long numberOfDocuments = collectionStats.maxDoc(); {code} It seems like {{collectionStats.docCount()}}, which returns the total number of documents that have at least one term for this field, is more appropriate statistics here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: numberOfDocuments in SimilarityBase
I think so. When adding this statistic (lucene 4.0), personally I really wanted to fix it everywhere. But we had the problem of backwards compatibility, and its bad to use different formulas for different segments even if it works... Nowadays we dont have lucene 3 segments around anymore, so I think we should fix this. Want to open an issue? On Wed, Jul 29, 2015 at 10:45 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hello List, SimilarityBase uses CollectionStatistics#maxDoc() for numberOfDocuments. Shouldn't it be field-based CollectionStatistics#docCount()? --- core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java (revision 1693268) +++ core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java (working copy) @@ -102,7 +102,7 @@ protected void fillBasicStats(BasicStats stats, CollectionStatistics collectionStats, TermStatistics termStats) { // #positions(field) must be = #positions(term) assert collectionStats.sumTotalTermFreq() == -1 || collectionStats.sumTotalTermFreq() = termStats.totalTermFreq(); -long numberOfDocuments = collectionStats.maxDoc(); +long numberOfDocuments = collectionStats.docCount(); Thanks, Ahmet - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
numberOfDocuments in SimilarityBase
Hello List, SimilarityBase uses CollectionStatistics#maxDoc() for numberOfDocuments. Shouldn't it be field-based CollectionStatistics#docCount()? --- core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java (revision 1693268) +++ core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java (working copy) @@ -102,7 +102,7 @@ protected void fillBasicStats(BasicStats stats, CollectionStatistics collectionStats, TermStatistics termStats) { // #positions(field) must be = #positions(term) assert collectionStats.sumTotalTermFreq() == -1 || collectionStats.sumTotalTermFreq() = termStats.totalTermFreq(); -long numberOfDocuments = collectionStats.maxDoc(); +long numberOfDocuments = collectionStats.docCount(); Thanks, Ahmet - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org