[jira] [Commented] (LUCENE-8015) TestBasicModelIne.testRandomScoring failure

Robert Muir (JIRA) Tue, 31 Oct 2017 21:46:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16233648#comment-16233648
 ]


Robert Muir commented on LUCENE-8015:
-------------------------------------

I dug into the I\(n) and I\(ne) failures here via the new test, their biggest 
problem is in the BasicModel itself.

These idf-like functions have the "log1p" trap due to the formulas in use. Note 
their formula is {{log2((maxDoc + 1) / (x + 0.5))}} where x is docFreq for 
I\(n), expected docFreq for I\(ne), and totalTermFreq for I\(F). So the worst 
case (e.g. term in every doc) gets even worse as collection size increases, 
because we take log of values increasingly closer to 1.

BasicModel I\(F) never fails because we added a floor in its log: we had to, 
since it would otherwise go negative when totalTermFreq exceeds maxDoc, which 
can easily happen. But we should fix the other two in the same way, I think. It 
does not change retrieval quality in my tests.

If I floor them to avoid this issue like this, it fixes all their fails here 
and they survive hundred rounds of beasting by my new test:
{noformat}
--- 
a/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java
+++ 
b/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java
@@ -33,7 +33,7 @@ public class BasicModelIn extends BasicModel {
   public final double score(BasicStats stats, double tfn) {
     long N = stats.getNumberOfDocuments();
     long n = stats.getDocFreq();
-    return tfn * log2((N + 1) / (n + 0.5));
+    return tfn * log2(1 + (N + 1) / (n + 0.5));
   }
   
   @Override
--- 
a/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIne.java
+++ 
b/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIne.java
@@ -34,7 +34,7 @@ public class BasicModelIne extends BasicModel {
     long N = stats.getNumberOfDocuments();
     long F = stats.getTotalTermFreq();
     double ne = N * (1 - Math.pow((N - 1) / (double)N, F));
-    return tfn * log2((N + 1) / (ne + 0.5));
+    return tfn * log2(1 + (N + 1) / (ne + 0.5));
   }
{noformat}

Model G failures are separate, I have not looked into it yet.

> TestBasicModelIne.testRandomScoring failure
> -------------------------------------------
>
>                 Key: LUCENE-8015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8015
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Major
>         Attachments: LUCENE-8015_test_fangs.patch
>
>
> reproduce with: ant test  -Dtestcase=TestBasicModelIne 
> -Dtests.method=testRandomScoring -Dtests.seed=86E85958B1183E93 
> -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Pacific/Tongatapu 
> -Dtests.asserts=true -Dtests.file.encoding=UTF8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8015) TestBasicModelIne.testRandomScoring failure

Reply via email to