[jira] [Commented] (LUCENE-8011) Improve similarity explanations

ASF GitHub Bot (JIRA) Fri, 01 Dec 2017 00:38:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274115#comment-16274115
 ]


ASF GitHub Bot commented on LUCENE-8011:
----------------------------------------

Github user jpountz commented on the issue:

    https://github.com/apache/lucene-solr/pull/280
  
    Thanks @mayya-sharipova, this looks like great progress to me. Maybe we 
could go even further and do the following:
     - in the Axiomatic similarity, add abstract methods to allow sub classes 
to explain how tf, ln, etc. are computed,
     - make BasicModel.explain abstract to force sub classes to have their own 
explanation and include the formula,
     - make sure that our own sub classes of SimilarityBase extend explain (the 
one that returns an explanation) and include the formula in the explanation.
    
    For the record, there is not too much concern to have about backward 
compatibility since most of those classes (eg. Axiomatic, BasicModel) are very 
expert classes and this changes targets master.


> Improve similarity explanations
> -------------------------------
>
>                 Key: LUCENE-8011
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8011
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>              Labels: newdev
>
> LUCENE-7997 improves BM25 and Classic explains to better explain:
> {noformat}
> product of:
>   2.2 = scaling factor, k1 + 1
>   9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
>     1.0 = n, number of documents containing term
>     17927.0 = N, total number of documents with field
>   0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) 
> from:
>     979.0 = freq, occurrences of term within document
>     1.2 = k1, term saturation parameter
>     0.75 = b, length normalization parameter
>     1.0 = dl, length of field
>     1.0 = avgdl, average length of field
> {noformat}
> Previously it was pretty cryptic and used confusing terminology like 
> docCount/docFreq without explanation: 
> {noformat}
> product of:
>   0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / 
> (docFreq + 0.5)) from:
>     449.0 = docFreq
>     456.0 = docCount
>   2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b 
> * fieldLength / avgFieldLength)) from:
>     113659.0 = freq=113658
>     1.2 = parameter k1
>     0.75 = parameter b
>     2300.5593 = avgFieldLength
>     1048600.0 = fieldLength
> {noformat}
> We should fix other similarities too in the same way, they should be more 
> practical.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8011) Improve similarity explanations

Reply via email to