[ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marko Bonaci updated LUCENE-6687:
---------------------------------
    Description: 
In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
{{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document basically, 
but it doesn't have to be an existing doc.

!solr-mlt-tf-doubling-bug.png|height=500!

There are 2 for loops, one inside the other, which both loop through the same 
set of fields.
That effectively doubles the term frequency for all the terms from fields that 
we provide in MLT QP {{qf}} parameter. 
It basically goes two times over the list of fields and accumulates the term 
frequencies from all fields into {{termFreqMap}}.

The private method {{retrieveTerms}} is only called from one public method, the 
version of overloaded method {{like}} that receives a Map: so that private 
class member {{fieldNames}} is always derived from {{retrieveTerms}}'s argument 
{{fields}}.
 
Uh, I don't understand what I wrote myself, but that basically means that, by 
the time {{retrieveTerms}} method gets called, its parameter fields and private 
member {{fieldNames}} always contain the same list of fields.

Here's the proof:
These are the final results of the calculation:

!solr-mlt-tf-doubling-bug-results.png|height=700!

And this is the actual {{thread_id:TID0009}} document, where those values were 
derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):

!terms-glass.png|height=100!

!terms-angry.png|height=100!

!terms-how.png|height=100!

!terms-accumulator.png|height=100!

Now, let's further test this hypothesis by seeing MLT QP in action from the 
AdminUI.
Let's try to find docs that are More Like doc {{TID0009}}. 
Here's the interesting part, the query:

{code}
q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
{code}

We just saw, in the last image above, that the term accumulator appears {{7}} 
times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as {{14}}.
By using {{mintf=14}}, we say that, when calculating similarity, we don't want 
to consider terms that appear less than 14 times (when terms from fields 
{{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
I added the term accumulator in only one other document ({{TID0004}}), where it 
appears only once, in the field {{title_mlt}}. 

!solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!

Let's see what happens when we use {{mintf=15}}:

!solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!

I should probably mention that multiple fields ({{qf}}) work because I applied 
the patch: [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].

Bug, no?


  was:
In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
{{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document basically, 
but it doesn't have to be an existing doc.

!solr-mlt-tf-doubling-bug.png|height=500!

There are 2 for loops, one inside the other, which both loop through the same 
set of fields.
That effectively doubles the term frequency for all the terms from fields that 
we provide in MLT QP {{qf}} parameter. 
It basically goes two times over the list of fields and accumulates the term 
frequencies from all fields into {{termFreqMap}}.

The private method {{retrieveTerms}} is only called from one public method, the 
version of overloaded method {{like}} that receives a Map: so that private 
class member {{fieldNames}} is always derived from {{retrieveTerms}}'s argument 
{{fields}}.
 
Uh, I don't understand what I wrote myself, but that basically means that, by 
the time {{retrieveTerms}} method gets called, its parameter fields and private 
member {{fieldNames}} always contain the same list of fields.

Here's the proof:
These are the final results of the calculation:

!solr-mlt-tf-doubling-bug-results.png|height=700!

And this is the actual {{thread_id:TID0009}} document, where those values were 
derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):

!terms-glass.png|height=100!

!terms-angry.png|height=100!

!terms-how.png|height=100!

!terms-accumulator.png|height=100!

Now, let's further test this hypothesis by seeing MLT QP in action from the 
AdminUI.
Let's try to find docs that are More Like doc {{TID0009}}. 
Here's the interesting part, the query:

{code}
q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
{code}

We just saw, in the last image above, that the term accumulator appears {{7}} 
times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as {{14}}.
By using {{mintf=14}}, we say that, when calculating similarity, we don't want 
to consider terms that appear less than 14 times (when terms from fields 
{{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
I added the term accumulator in only one other document ({{TID0004}}), where it 
appears only once, in the field {{title_mlt}}. 

!solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!

Let's see what happens when we use {{mintf=15}}:

!solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!

I should probably mention that multiple fields work because I applied the 
patch: [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].

Bug, no?



> MLT term frequency calculation bug
> ----------------------------------
>
>                 Key: LUCENE-6687
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6687
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring, core/queryparser
>    Affects Versions: 5.2.1, Trunk
>         Environment: OS X v10.10.4; Solr 5.2.1
>            Reporter: Marko Bonaci
>         Attachments: buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to