[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

Robert Muir (JIRA) Fri, 20 Oct 2017 18:52:28 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-7997:
--------------------------------
    Attachment: LUCENE-7997_wip.patch

Updated patch with more cleanups around explain. I tried to add descriptions 
for parts of the formula and also use standard nomenclature. I think its better 
now, here is typical output:

{noformat}
20.629753 = score(doc=0,freq=979.0), product of:
  2.2 = scaling factor, k1 + 1
  9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
    1.0 = n, number of documents containing term
    17927.0 = N, total number of documents with field
  0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) 
from:
    979.0 = freq, occurrences of term within document
    1.2 = k1, term saturation parameter
    0.75 = b, length normalization parameter
    1.0 = dl, length of field
    1.0 = avgdl, average length of field
{noformat}

You can more easily see term frequency saturation including extreme cases such 
as 1.0 where no more occurrences can help. You can kinda visualize how it can 
work for maxScore now :)

{noformat}
...
  1.0 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
    5.9470048E8 = freq, occurrences of term within document
    1.2 = k1, term saturation parameter
    0.75 = b, length normalization parameter
    40.0 = dl, length of field (approximate)
    3.72180768E8 = avgdl, average length of field
...
{noformat}


> More sanity testing of similarities
> -----------------------------------
>
>                 Key: LUCENE-7997
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7997
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7997_wip.patch, LUCENE-7997_wip.patch, 
> LUCENE-7997_wip.patch, LUCENE-7997_wip.patch
>
>
> LUCENE-7993 is a potential optimization that we could only apply if the 
> similarity is an increasing functions of {{freq}} (all other things like DF 
> and length being equal). This sounds like a very reasonable requirement for a 
> similarity, so we should test it in the base similarity test case and maybe 
> move broken similarities to sandbox?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7997) More sanity testing of similarities

Reply via email to