[
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801564#comment-16801564
]
ASF subversion and git services commented on LUCENE-6687:
---------------------------------------------------------
Commit 42a548e28efd74e283bfafaf6dabe7ebe01251e5 in lucene-solr's branch
refs/heads/master from Tommaso Teofili
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=42a548e ]
LUCENE-6687 - avoid unnecessary looping
> MLT term frequency calculation bug
> ----------------------------------
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/query/scoring, core/queryparser
> Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
> Reporter: Marko Bonaci
> Assignee: Tommaso Teofili
> Priority: Major
> Fix For: 5.2.2
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch,
> LUCENE-6687.patch, buggy-method-usage.png,
> solr-mlt-tf-doubling-bug-results.png,
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png,
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png,
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png,
> terms-glass.png, terms-how.png
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same
> set of fields.
> That effectively doubles the term frequency for all the terms from fields
> that we provide in MLT QP {{qf}} parameter.
> It basically goes two times over the list of fields and accumulates the term
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method,
> the version of overloaded method {{like}} that receives a Map: so that
> private class member {{fieldNames}} is always derived from
> {{retrieveTerms}}'s argument {{fields}}.
>
> Uh, I don't understand what I wrote myself, but that basically means that, by
> the time {{retrieveTerms}} method gets called, its parameter fields and
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}.
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}}
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't
> want to consider terms that appear less than 14 times (when terms from fields
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where
> it appears only once, in the field {{title_mlt}}.
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I
> applied the patch:
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]