Hello,

I’m Fiona with Basis Technology. We’re investigating what we believe to be
a bug involving inconsistent query results. We have binary searched this
issue and found that it specifically appears when flattening nested
disjunctions was introduced with the merge of LUCENE-7386
<https://issues.apache.org/jira/browse/LUCENE-7386>. In order to reproduce
the issue, I have attached a Lucene index built in Lucene 8.1.0 as
names_index.tar.gz and if you run the attached Java class
(LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the
max_score is the same between runs whereas if you run it against Lucene
8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs
and you should be able to see that sometimes it returns max_score of
1.8651859 and sometimes 2.1415303).

>From debugging in Lucene 8.1.0, the query against the name index before
flattening its nested disjunctions looks like below:

(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR
(bt_rni_name_encoded_1:ANTR)^0.75
(bt_rni_name_encoded_1:LTR)^0.6666666)
((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75
(bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75
(bt_rni_name_encoded_1:FTR)^0.6666666
(bt_rni_name_encoded_1:LTR)^0.6666666)) |
(((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR
(bt_rni_name_encoded_2:FLTRN)^0.75))


The term that's causing the difference in the final score is
bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows
twice nested under different clauses: in the first clause that it occurs
the docFreq for it is 3, and for the same term but in the second clause
that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; *is
a term being read with different docFreq values expected behaviour? *

After flattening the nested disjunctions (part of query rewrite process),
the query looks like below:

((bt_rni_name_encoded_1:FTR)^0.6666666
(bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75
(bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75
(bt_rni_name_encoded_1:ANTR)^0.75
(bt_rni_name_encoded_1:LTR)^1.3333333
(bt_rni_name_encoded_1:ALTR)^1.75) |
((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)


As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight
has been summed up from the original query. This is the version of the
query that actually gets used, and the docFreq here for the
bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it
shows as 2 between runs and final score changes accordingly to that. *Is
this "coin toss" pick of docFreq for the same term expected behaviour? *

Looks like the issue stems from one of the behaviours observed and
highlighted in bold.

Looking forward to hearing back from you.

Attachment: names_index.tar.gz
Description: GNU Zip compressed data

Attachment: LuceneSearchIndex.java
Description: Binary data

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to