subject:"\[jira\] \[Updated\] \(LUCENE\-7997\) More sanity testing of similarities"

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-24 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997.patch

Updated patch. I hooked in CheckHits for more explains testing, and test nearby 
norm and nearby slightly rarer term to ensure relevance doesn't go backwards in 
those cases too.

I updated the AwaitsFix url to a separate issue to fix sims with bugs / move to 
sandbox: LUCENE-8010

Finally i optimized the tests to have more reasonable runtime. I think its 
ready for now.


> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Updated patch that also tests floating point tf values. We assume a 
computeSlopFactor has the range {{(0 .. 1]}} for testing. This found a leftover 
buggy float cast in DFR {{I(F)}} but also a new bug: Axiomatic model F1 will 
most likely return NaN values if you use SloppyPhraseQuery! frequency values < 
1 cause its first log to go negative, then the next log to go NaN: formula is 
{{1 + log(1 + log(freq))}}. Imagine freq=0.3, this is {{1 + log(1 + -1.2)}} = 
{{1 + log(-0.2)}} = NaN. If we alter the formula to use {{log(1 + freq)}} then 
tests pass but needs investigation/may not be an appropriate solution, so i 
marked AwaitsFix for now.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

updated to test all sims and parameters.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Patch randomizing values of parameters, adding missing range checks/docs for 
these parameters. These are just the valid ranges documented by the formulas, 
for unbounded parameters (such as normalization c, smoothing parameter mu) we 
treat them the same as BM25's k1 and just ensure non-negative/finite in the 
range check, and test the range of 0..Integer.MAX_VALUE.

Still TODO is axiomatic parameters, need to look at paper and existing code (it 
has some range checks already so it may be easy).

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Updated patch with all remaining sims (axiomatic and language models) now 
tested.
The axiomatic F3EXP and F3LOG fail due to their gamma function driving scores 
negative, I added a warning to their javadocs about this. Also note that these 
two models don't have default parameter-free ctors. The other 4 models (F1EXP, 
F1LOG, F2EXP, F2LOG) are all fine, they don't have this gamma function.

At least now we have the lay of the land, it is as expected. 

Still need to deal with many parameters which aren't yet tested. In many cases 
these are also missing any range checks, we need to dig up/figure out the valid 
domain, randomize them, look for issues etc. But the default values are tested.


> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

updated with the information-based models. LL passes the test, and SPL fails as 
expected, it has warnings in its javadocs.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

patch with the 3 DFI models passing too.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-23 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Updated patch with DFR passing/failing the new tests as expected:
* scoring models without warnings in the javadocs pass: models {{G}}, {{I(F)}}, 
{{I\(n)}}, {{I(ne)}} 
* ones with warnings in javadocs all fail: models {{BE}}, {{D}}, and {{P}}

I think this is a good sign it works to do what we need. To make DFR pass at 
all, I changed SimilarityBase to use {{double}} everywhere internally, then 
cast to 32-bit float at the end. This fixed all the numerical errors. I think 
this makes sense as this subclass is supposed to be simple and easy to use 
(separately, we should take another look at the whole thing now that a lot of 
ClassicSimilarity's complexity has been removed). It makes the formulas more 
elegant in many cases too because constants like {{5.0}} are naturally doubles 
and all java Math functions take doubles, so some casts etc get removed.

Will work thru the other models and look at potential improvements to explain 
etc here too for consistency.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-20 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Added ClassicSimilarity and BooleanSimilarity to testing, randomized bm25 
parameters and boosts.
ClassicSimilarity was fine just needed explain() cleaned up to exactly match 
score().

Note that query boosts and bm25's k1 parameter are only tested within a 
"reasonable" ranges (0..Integer.MAX_VALUE) so we can fail the test if the sim 
has internal unexpected overflows, this is just trying to kick out the sim bugs.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-20 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Updated patch with more cleanups around explain. I tried to add descriptions 
for parts of the formula and also use standard nomenclature. I think its better 
now, here is typical output:

{noformat}
20.629753 = score(doc=0,freq=979.0), product of:
  2.2 = scaling factor, k1 + 1
  9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
1.0 = n, number of documents containing term
17927.0 = N, total number of documents with field
  0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) 
from:
979.0 = freq, occurrences of term within document
1.2 = k1, term saturation parameter
0.75 = b, length normalization parameter
1.0 = dl, length of field
1.0 = avgdl, average length of field
{noformat}

You can more easily see term frequency saturation including extreme cases such 
as 1.0 where no more occurrences can help. You can kinda visualize how it can 
work for maxScore now :)

{noformat}
...
  1.0 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
5.9470048E8 = freq, occurrences of term within document
1.2 = k1, term saturation parameter
0.75 = b, length normalization parameter
40.0 = dl, length of field (approximate)
3.72180768E8 = avgdl, average length of field
...
{noformat}


> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-20 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

Updated patch, also enforcing that explain == score (exactly, no floating point 
differences). 

I cleaned up the BM25 explain to be transparent and reflect how the calculation 
is done.
Most importantly, explanation is now broken out as {{scaling * df * tf}}, like 
how we compute it, and described in 
http://kak.tx0.org/Information-Retrieval/TFxIDF rather than displaying the 
"re-arranged formula" with tf including the {{k1 + 1}} scaling factor. Maybe 
its an improvement for debugging, too since it pulls out the independent 
scaling factor, making it easier to see the specifics of term frequency 
saturation and IDF across docs/terms?

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-19 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

I updated patch with a possible fix for the monotonic issue.

at least so tests pass for now and we can add other checks (like try to fix 
explain) and understand all the issues.

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

2017-10-19 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-7997:

Attachment: LUCENE-7997_wip.patch

hacky patch with my current state. Spent a lot of time looking at reasonable 
state space, which is really hard since we don't have limits to number of 
documents, no bounds on boosts, etc. Tried really hard (maybe too much) to be 
super-fair to the similarity, e.g. test shouldn't generate scenarios that are 
impossible to create with IndexWriter. But some things (like huge tf values but 
tiny norm values) are fair game because we don't limit stacking terms/synonyms 
and so forth. This stuff may still have interesting test bugs if beasted enough.

Currently the test fails, it seems like our bm25 may "go backwards" for largish 
term freqs, looks like floating point issues to me. Haven't tried to debug that 
yet, other crabs to chase down first.

Can't really debug anything about this test until i think, we first force 
explain() to *exactly* match score() for a sim. I realize this is a PITA, but I 
think we need that and will look into that next.

Here is an example of test output for the "going backwards" example, where it 
fails the pairwise test but the explanation doesnt show it. Still need to 
improve this, so its really easy to write a one-line test method for any 
scenario, and so on.

{noformat}
[junit4:pickseed] Seed property 'tests.seed' already defined: CA6EF971C3E23AAF
   [junit4]  says ciao! Master seed: CA6EF971C3E23AAF
   [junit4] Executing 1 suite with 1 JVM.
   [junit4] 
   [junit4] Started J0 PID(16127@localhost).
   [junit4] Suite: org.apache.lucene.search.similarities.TestBM25Similarity
   [junit4]   1> 0.03627357 = score(doc=0,freq=113659.0 = prevFreq=113658
   [junit4]   1> ), product of:
   [junit4]   1>   0.016547536 = idf, computed as log(1 + (docCount - docFreq + 
0.5) / (docFreq + 0.5)) from:
   [junit4]   1> 449.0 = docFreq
   [junit4]   1> 456.0 = docCount
   [junit4]   1>   2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + 
k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
   [junit4]   1> 113659.0 = prevFreq=113658
   [junit4]   1> 1.2 = parameter k1
   [junit4]   1> 0.75 = parameter b
   [junit4]   1> 2300.5593 = avgFieldLength
   [junit4]   1> 1048600.0 = fieldLength
   [junit4]   1> 
   [junit4]   1> 0.03627357 = score(doc=0,freq=113659.0 = currentFreq=113659
   [junit4]   1> ), product of:
   [junit4]   1>   0.016547536 = idf, computed as log(1 + (docCount - docFreq + 
0.5) / (docFreq + 0.5)) from:
   [junit4]   1> 449.0 = docFreq
   [junit4]   1> 456.0 = docCount
   [junit4]   1>   2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + 
k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
   [junit4]   1> 113659.0 = currentFreq=113659
   [junit4]   1> 1.2 = parameter k1
   [junit4]   1> 0.75 = parameter b
   [junit4]   1> 2300.5593 = avgFieldLength
   [junit4]   1> 1048600.0 = fieldLength
   [junit4]   1> 
   [junit4]   1> BM25(k1=1.2,b=0.75)
   [junit4]   1> 
field="field",maxDoc=13938,docCount=456,sumTotalTermFreq=1049055,sumDocFreq=456
   [junit4]   1> term="term",docFreq=449,totalTermFreq=196765
   [junit4]   1> norm=168 (doc length ~ 1048600)
   [junit4]   1> freq=113659
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBM25Similarity 
-Dtests.method=testRandomScoring -Dtests.seed=CA6EF971C3E23AAF 
-Dtests.locale=el -Dtests.timezone=Etc/GMT-13 -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.12s | TestBM25Similarity.testRandomScoring <<<
   [junit4]> Throwable #1: java.lang.AssertionError: 
score(113658)=0.036273565 > score(113659)=0.03627356
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([CA6EF971C3E23AAF:41F1A0C3D995DCA5]:0)
   [junit4]>at 
org.apache.lucene.search.similarities.BaseSimilarityTestCase.doTestScoring(BaseSimilarityTestCase.java:324)
   [junit4]>at 
org.apache.lucene.search.similarities.BaseSimilarityTestCase.testRandomScoring(BaseSimilarityTestCase.java:296)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=CheapBastard, 
sim=RandomSimilarity(queryNorm=true): {field=DFR I(ne)3(800.0)}, locale=el, 
timezone=Etc/GMT-13
   [junit4]   2> NOTE: Linux 4.4.0-92-generic amd64/Oracle Corporation 1.8.0_45 
(64-bit)/cpus=8,threads=1,free=134724456,total=189267968
   [junit4]   2> NOTE: All tests run in this JVM: [TestBM25Similarity]
   [junit4] Completed [1/1 (1!)] in 1.14s, 1 test, 1 failure <<< FAILURES!
{noformat}

> More sanity testing of similarities
> ---
>
> Key: LUCENE-7997
> URL: https://issues.apache.org/jira/browse/LUCENE-7997
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

13 matches

Site Navigation

Mail list logo

Footer information