Alessandro Benedetti updated LUCENE-6687:
    Attachment: LUCENE-6687.patch

> MLT term frequency calculation bug
> ----------------------------------
>                 Key: LUCENE-6687
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6687
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring, core/queryparser
>    Affects Versions: 5.2.1, 6.0
>         Environment: OS X v10.10.4; Solr 5.2.1
>            Reporter: Marko Bonaci
>            Priority: Major
>             Fix For: 5.2.2
>         Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> buggy-method-usage.png, solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to