[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706980#comment-14706980
 ] 

Robert Muir commented on LUCENE-6711:
-

I dont think its a bug with this.

Likely the typical bugs from crappy useless querynorm, and exposed by shaking 
things up.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-21 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706970#comment-14706970
 ] 

Hoss Man commented on LUCENE-6711:
--

possible bug identified by Terry Smith in LUCENE-6758

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705121#comment-14705121
 ] 

Robert Muir commented on LUCENE-6711:
-

Thanks. I think it found an unrelated bug.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-20 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705090#comment-14705090
 ] 

Steve Rowe commented on LUCENE-6711:


{{testNoFieldSkew()}} failure: https://issues.apache.org/jira/browse/LUCENE-6751

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705117#comment-14705117
 ] 

ASF subversion and git services commented on LUCENE-6711:
-

Commit 1696807 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1696807 ]

LUCENE-6711: improve test when it fails

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-13 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-6711:
---

Assignee: Robert Muir

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-13 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695631#comment-14695631
 ] 

ASF subversion and git services commented on LUCENE-6711:
-

Commit 1695744 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1695744 ]

LUCENE-6711: Use CollectionStatistics.docCount() for IDF and average field 
length computations

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695633#comment-14695633
 ] 

Robert Muir commented on LUCENE-6711:
-

Thanks Ahmet, I committed this. I only made cosmetic changes: I renamed local 
variable and parameter names to be docCount because numDocs is pretty 
confusing. I also added a test case for all of our similarities.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-13 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-6711.
-
Resolution: Fixed

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Assignee: Robert Muir
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-04 Thread Ahmet Arslan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Arslan updated LUCENE-6711:
-
Attachment: LUCENE-6711.patch

Patch that includes following migrate entry. But I am not sure this is an 
appropriate text for migrate.txt.
{panel:title=The way how number of document calculated is changed 
(LUCENE-6711)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#CE}
The number of documents (docCount) is used to calculate term specificity (idf) 
and average document length (avdl). Prior to LUCENE-6711, 
collectionStats.maxDoc() was used for the statistics. Now, 
collectionStats.docCount() is used whenever possible, if not maxDocs() is used.

Assume that a collection contains 100 documents, and 50 of them have keywords 
field. In this example, maxDocs is 100 while docCount is 50 for the keywords 
field. The total number of tokens for keywords field is divided by docCount 
to obtain avdl. Therefore, docCount which is the total number of documents that 
have at least one term for the field, is a more precise metric for optional 
fields.

DefaultSimilarity does not leverage avdl, so this change would have relatively 
minor change in the result list. Because relative idf values of terms will 
remain same. However, when combined with other factors such as term frequency, 
relative ranking of documents could change. Some Similarity implementations 
(such as the ones instantiated with NormalizationH2 and BM25) take account into 
avdl and would have notable change in ranked list. Especially if you have a 
collection of documents with varying lengths. Because NormalizationH2 tends to 
punish documents longer than avdl.
{panel}

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch, 
 LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-03 Thread Ahmet Arslan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Arslan updated LUCENE-6711:
-
Attachment: LUCENE-6711.patch

Includes changes to TFIDF and BM25, {{ant clean test}} passes.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-03 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652160#comment-14652160
 ] 

Hoss Man commented on LUCENE-6711:
--

bq. I also think its best to just make the change for trunk and not do it in a 
minor version.

+1

This is the sort of behavior change that should be noted in MIGRATE.txt -- 
Ahmet: could you take a stab at adding the necessary text in your patch?

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-03 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated LUCENE-6711:
-
Affects Version/s: (was: 5.2.1)
Fix Version/s: (was: 5.3)
   Trunk

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: Trunk

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch, LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-03 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651839#comment-14651839
 ] 

Robert Muir commented on LUCENE-6711:
-

yes, thats right

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-03 Thread Ahmet Arslan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Arslan updated LUCENE-6711:
-
Attachment: LUCENE-6711.patch

This patch checks for -1 and uses maxDoc() if docCount() is not unsupported.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-03 Thread Ahmet Arslan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651758#comment-14651758
 ] 

Ahmet Arslan commented on LUCENE-6711:
--

bq. We should fix TFIDFSimilarity and BM25Similarity too.

For TFIDF and BM25, do we simply replace {code}collectionStats.maxDoc(){code} 
with {code}collectionStats.docCount() == -1 ? collectionStats.maxDoc() : 
collectionStats.docCount(){code} ?

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch, LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: numberOfDocuments in SimilarityBase

2015-08-01 Thread Ahmet Arslan
Hi Robert,

Thanks for chiming in, I created LUCENE-6711 for this.

Ahmet


On Thursday, July 30, 2015 4:47 PM, Robert Muir rcm...@gmail.com wrote:
I think so. When adding this statistic (lucene 4.0), personally I
really wanted to fix it everywhere. But we had the problem of
backwards compatibility, and its bad to use different formulas for
different segments even if it works...

Nowadays we dont have lucene 3 segments around anymore, so I think we
should fix this. Want to open an issue?

On Wed, Jul 29, 2015 at 10:45 AM, Ahmet Arslan
iori...@yahoo.com.invalid wrote:
 Hello List,

 SimilarityBase uses CollectionStatistics#maxDoc() for numberOfDocuments.
 Shouldn't it be field-based CollectionStatistics#docCount()?

 --- core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java   
   (revision 1693268)
 +++ core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java   
   (working copy)
 @@ -102,7 +102,7 @@
 protected void fillBasicStats(BasicStats stats, CollectionStatistics 
 collectionStats, TermStatistics termStats) {
 // #positions(field) must be = #positions(term)
 assert collectionStats.sumTotalTermFreq() == -1 || 
 collectionStats.sumTotalTermFreq() = termStats.totalTermFreq();
 -long numberOfDocuments = collectionStats.maxDoc();
 +long numberOfDocuments = collectionStats.docCount();


 Thanks,
 Ahmet

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-01 Thread Ahmet Arslan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmet Arslan updated LUCENE-6711:
-
Attachment: LUCENE-6711.patch

Patch that includes suggested change. However, this breaks most of the tests in 
{{TestSimilarityBase}}. What is the preferred course of action here?  

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-01 Thread Ahmet Arslan (JIRA)
Ahmet Arslan created LUCENE-6711:


 Summary: Instead of docCount(), maxDoc() is used for 
numberOfDocuments in SimilarityBase
 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3


{{SimilarityBase.java}} has the following line :
{code}
 long numberOfDocuments = collectionStats.maxDoc();
{code}

It seems like {{collectionStats.docCount()}}, which returns the total number of 
documents that have at least one term for this field, is more appropriate 
statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650499#comment-14650499
 ] 

Robert Muir commented on LUCENE-6711:
-

IndexReader/Terms etc still document this as an optional statistic: I think we 
should keep it that way. E.G. maybe its hard to compute for some FilterReader, 
who knows.

So I think we should do a fallback like the other statistics: check for -1 and 
use maxDoc if its unsupported.

But I think its a good time to make the change. For ordinary users, it will not 
be trappy/happen incrementally: all these statistics have been supported since 
4.0. We should fix TFIDFSimilarity and BM25Similarity too.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650510#comment-14650510
 ] 

Robert Muir commented on LUCENE-6711:
-

This is already a pluggable api so someone can do that if they want: lets not 
make our code complicated. I also think its best to just make the change for 
trunk and not do it in a minor version.

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6711) Instead of docCount(), maxDoc() is used for numberOfDocuments in SimilarityBase

2015-08-01 Thread Upayavira (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650500#comment-14650500
 ] 

Upayavira commented on LUCENE-6711:
---

I've often wondered the same sort of thing.

Now, given that this will likely change the score for every single query anyone 
does on any Lucene based search, would it be possible to make this 
configurable, so that people can choose which one they want? More particularly, 
to choose the point at which their scoring will change?

 Instead of docCount(), maxDoc() is used for numberOfDocuments in 
 SimilarityBase
 ---

 Key: LUCENE-6711
 URL: https://issues.apache.org/jira/browse/LUCENE-6711
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.2.1
Reporter: Ahmet Arslan
Priority: Minor
 Fix For: 5.3

 Attachments: LUCENE-6711.patch


 {{SimilarityBase.java}} has the following line :
 {code}
  long numberOfDocuments = collectionStats.maxDoc();
 {code}
 It seems like {{collectionStats.docCount()}}, which returns the total number 
 of documents that have at least one term for this field, is more appropriate 
 statistics here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: numberOfDocuments in SimilarityBase

2015-07-30 Thread Robert Muir
I think so. When adding this statistic (lucene 4.0), personally I
really wanted to fix it everywhere. But we had the problem of
backwards compatibility, and its bad to use different formulas for
different segments even if it works...

Nowadays we dont have lucene 3 segments around anymore, so I think we
should fix this. Want to open an issue?

On Wed, Jul 29, 2015 at 10:45 AM, Ahmet Arslan
iori...@yahoo.com.invalid wrote:
 Hello List,

 SimilarityBase uses CollectionStatistics#maxDoc() for numberOfDocuments.
 Shouldn't it be field-based CollectionStatistics#docCount()?

 --- core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java   
   (revision 1693268)
 +++ core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java   
   (working copy)
 @@ -102,7 +102,7 @@
 protected void fillBasicStats(BasicStats stats, CollectionStatistics 
 collectionStats, TermStatistics termStats) {
 // #positions(field) must be = #positions(term)
 assert collectionStats.sumTotalTermFreq() == -1 || 
 collectionStats.sumTotalTermFreq() = termStats.totalTermFreq();
 -long numberOfDocuments = collectionStats.maxDoc();
 +long numberOfDocuments = collectionStats.docCount();


 Thanks,
 Ahmet

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



numberOfDocuments in SimilarityBase

2015-07-29 Thread Ahmet Arslan
Hello List,

SimilarityBase uses CollectionStatistics#maxDoc() for numberOfDocuments.
Shouldn't it be field-based CollectionStatistics#docCount()?

--- core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java 
(revision 1693268)
+++ core/src/java/org/apache/lucene/search/similarities/SimilarityBase.java 
(working copy)
@@ -102,7 +102,7 @@
protected void fillBasicStats(BasicStats stats, CollectionStatistics 
collectionStats, TermStatistics termStats) {
// #positions(field) must be = #positions(term)
assert collectionStats.sumTotalTermFreq() == -1 || 
collectionStats.sumTotalTermFreq() = termStats.totalTermFreq();
-long numberOfDocuments = collectionStats.maxDoc();
+long numberOfDocuments = collectionStats.docCount();


Thanks,
Ahmet

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org