RE: Inconsistent query results in Lucene 8.1.0

Staley, Phil R - DCF Fri, 06 Mar 2020 11:02:07 -0800

We recently upgraded to our Drupal 8 sites to SOLR 8.3.1.  We are now getting 
reports of certain patterns of search terms resulting in an error that reads, 
“The website encountered an unexpected error. Please try again later.”




Below is a list of example terms that always result in this error and a similar 
list that works fine.  The problem pattern seems to be a search term that 
contains 2 or 3 characters followed by a space, followed by additional text.



To confirm that the problem is version 8 of SOLR, I have updated our local and 
UAT sites with the latest Drupal updates that did include an update to the 
Search API Solr module and tested the terms below under SOLR 7.7.2, 8.3.1, and 
8.4.1.  Under version 7.7.2  everything works fine. Under either of the version 
8, the problem returns.



Thoughts?



Search terms that result in error

• w-2 agency directory

• agency w-2 directory

• w-2 agency

• w-2 directory

• w2 agency directory

• w2 agency

• w2 directory



Search terms that do not result in error • w-22 agency directory • agency 
directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 • agency 
directory • agency • directory • -2 agency directory • 2 agency directory • 
w-2agency directory • w2agency directory


From: Michele Palmia <micpal...@gmail.com>
Sent: Friday, March 6, 2020 9:50 AM
To: dev@lucene.apache.org
Subject: Re: Inconsistent query results in Lucene 8.1.0

Hi all,

I looked into this today. I can reproduce it and I believe it's a bug.
This is caused by the following working together:
- 
LUCENE-7386<https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>
 Flatten nested disjunctions
- 
LUCENE-7925<https://secure-web.cisco.com/1hzn5x604aHO9rCwQ2LgnrasmSRAfGal79Kj0TxxLjLVvoXnCA2qw7hnjtlkZFqVG-5QSDKfdkxwyo7HbsdW02QQjr0hkeD2MM-Arlgh8Me7TL3VL1WtaWpdPLTthfJfHxytGjEuHe4_lgaXBOPGT0Asc4mgOUL8X0HZvEFwHdPyr8Frjgc9xXNJMSxue85CPT6wX_vTczFI5WIJptjmt5HPnhD-2109aCueO-F0bw7XssxckniCtAlIkUaRCrt-PRYhXal-7UGzFztVDHNI9Xg/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7925>
 Deduplicate SHOULD and MUST clauses in BooleanQuery

Blended term queries modify the df/ttf of their terms to make sure all terms 
produce identical scores. In this case, two blended term queries contain a few 
terms each, only some of which overlap. The two queries calculate different 
df/ttf for their terms respectively, since the two sets are different. During 
the rewrite process,

  1.  the two Blended queries get rewritten as Boolean queries themselves, with 
each (modified) TermQuery as a SHOULD clause
  2.  the nested Boolean queries get flattened, since they are nested 
disjunctions
  3.  the Term queries (some of which are actually Boost queries) are 
deduplicated, with one of the two TermQuery and its modified TermStates being 
picked at random (the randomness is due to the HashSet underlying Lucene's 
MultiSet).
I haven't managed to create a failing test yet, I'll share it when I have one 
ready.
If anybody has suggestions or pointers on how this should be fixed, I'm also 
happy to provide a patch - I'm just a bit clueless what the right thing to do 
would be here: I have a feeling (2.) should not happen for (rewritten) Blended 
Queries?

Cheers,
Michele


On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj 
<fi...@basistech.com<mailto:fi...@basistech.com>> wrote:
Hello,

I’m Fiona with Basis Technology. We’re investigating what we believe to be a 
bug involving inconsistent query results. We have binary searched this issue 
and found that it specifically appears when flattening nested disjunctions was 
introduced with the merge of 
LUCENE-7386<https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>.
 In order to reproduce the issue, I have attached a Lucene index built in 
Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class 
(LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the 
max_score is the same between runs whereas if you run it against Lucene 8.1.0 
you'll see inconsistent max_score between runs (try a max of 10 runs and you 
should be able to see that sometimes it returns max_score of 1.8651859 and 
sometimes 2.1415303).

From debugging in Lucene 8.1.0, the query against the name index before 
flattening its nested disjunctions looks like below:



(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR 
(bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) 
((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 
(bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 
(bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | 
(((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR 
(bt_rni_name_encoded_2:FLTRN)^0.75))

The term that's causing the difference in the final score is 
bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice 
nested under different clauses: in the first clause that it occurs the docFreq 
for it is 3, and for the same term but in the second clause that it appears in, 
its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read 
with different docFreq values expected behaviour?

After flattening the nested disjunctions (part of query rewrite process), the 
query looks like below:



((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 
(bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 
(bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 
(bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | 
((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)

As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has 
been summed up from the original query. This is the version of the query that 
actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR 
term sometimes it shows as 3 and sometimes it shows as 2 between runs and final 
score changes accordingly to that. Is this "coin toss" pick of docFreq for the 
same term expected behaviour?

Looks like the issue stems from one of the behaviours observed and highlighted 
in bold.

Looking forward to hearing back from you.


---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.org<mailto:dev-unsubscr...@lucene.apache.org>
For additional commands, e-mail: 
dev-h...@lucene.apache.org<mailto:dev-h...@lucene.apache.org>

RE: Inconsistent query results in Lucene 8.1.0

Reply via email to