Re: Inconsistent query results in Lucene 8.1.0

David Smiley Fri, 06 Mar 2020 18:44:01 -0800

Hi Phil,

Please start new threads (emails) for new problems instead of replying to
an existing one.  The behavior of the existing thread does not result in an
error; yours does, and so I think they are entirely dissimilar.  Also,
you'll need to dig deeper to learn what the particular error was and report
that.  Go to Solr's logs.


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Mar 6, 2020 at 2:01 PM Staley, Phil R - DCF <
phil.sta...@wisconsin.gov> wrote:

> We recently upgraded to our Drupal 8 sites to SOLR 8.3.1.  We are now
> getting reports of certain patterns of search terms resulting in an error
> that reads, “The website encountered an unexpected error. Please try again
> later.”
>
>
>
> Below is a list of example terms that always result in this error and a
> similar list that works fine.  The problem pattern seems to be a search
> term that contains 2 or 3 characters followed by a space, followed by
> additional text.
>
>
>
> To confirm that the problem is version 8 of SOLR, I have updated our local
> and UAT sites with the latest Drupal updates that did include an update to
> the Search API Solr module and tested the terms below under SOLR 7.7.2,
> 8.3.1, and 8.4.1.  Under version 7.7.2  everything works fine. Under either
> of the version 8, the problem returns.
>
>
>
> Thoughts?
>
>
>
> Search terms that result in error
>
> • w-2 agency directory
>
> • agency w-2 directory
>
> • w-2 agency
>
> • w-2 directory
>
> • w2 agency directory
>
> • w2 agency
>
> • w2 directory
>
>
>
> Search terms that do not result in error • w-22 agency directory • agency
> directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 •
> agency directory • agency • directory • -2 agency directory • 2 agency
> directory • w-2agency directory • w2agency directory
>
>
>
>
>
> *From:* Michele Palmia <micpal...@gmail.com>
> *Sent:* Friday, March 6, 2020 9:50 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Inconsistent query results in Lucene 8.1.0
>
>
>
> Hi all,
>
>
>
> I looked into this today. I can reproduce it and I believe it's a bug.
>
> This is caused by the following working together:
> - LUCENE-7386
> <https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>
> Flatten nested disjunctions
>
> - LUCENE-7925
> <https://secure-web.cisco.com/1hzn5x604aHO9rCwQ2LgnrasmSRAfGal79Kj0TxxLjLVvoXnCA2qw7hnjtlkZFqVG-5QSDKfdkxwyo7HbsdW02QQjr0hkeD2MM-Arlgh8Me7TL3VL1WtaWpdPLTthfJfHxytGjEuHe4_lgaXBOPGT0Asc4mgOUL8X0HZvEFwHdPyr8Frjgc9xXNJMSxue85CPT6wX_vTczFI5WIJptjmt5HPnhD-2109aCueO-F0bw7XssxckniCtAlIkUaRCrt-PRYhXal-7UGzFztVDHNI9Xg/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7925>
> Deduplicate SHOULD and MUST clauses in BooleanQuery
>
>
>
> Blended term queries modify the df/ttf of their terms to make sure all
> terms produce identical scores. In this case, two blended term queries
> contain a few terms each, only some of which overlap. The two queries
> calculate different df/ttf for their terms respectively, since the two sets
> are different. During the rewrite process,
>
>    1. the two Blended queries get rewritten as Boolean queries
>    themselves, with each (modified) TermQuery as a SHOULD clause
>    2. the nested Boolean queries get flattened, since they are nested
>    disjunctions
>    3. the Term queries (some of which are actually Boost queries) are
>    deduplicated, with one of the two TermQuery and its modified TermStates
>    being picked at random (the randomness is due to the HashSet underlying
>    Lucene's MultiSet).
>
> I haven't managed to create a failing test yet, I'll share it when I have
> one ready.
>
> If anybody has suggestions or pointers on how this should be fixed, I'm
> also happy to provide a patch - I'm just a bit clueless what the right
> thing to do would be here: I have a feeling (2.) should not happen for
> (rewritten) Blended Queries?
>
>
>
> Cheers,
>
> Michele
>
>
>
>
>
> On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <fi...@basistech.com> wrote:
>
> Hello,
>
>
>
> I’m Fiona with Basis Technology. We’re investigating what we believe to be
> a bug involving inconsistent query results. We have binary searched this
> issue and found that it specifically appears when flattening nested
> disjunctions was introduced with the merge of LUCENE-7386
> <https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>.
> In order to reproduce the issue, I have attached a Lucene index built in
> Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class
> (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the
> max_score is the same between runs whereas if you run it against Lucene
> 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs
> and you should be able to see that sometimes it returns max_score of
> 1.8651859 and sometimes 2.1415303).
>
>
>
> From debugging in Lucene 8.1.0, the query against the name index before
> flattening its nested disjunctions looks like below:
>
>
> (((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR 
> (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) 
> ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 
> (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 
> (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) 
> | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR 
> (bt_rni_name_encoded_2:FLTRN)^0.75))
>
>
> The term that's causing the difference in the final score is
> bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows
> twice nested under different clauses: in the first clause that it occurs
> the docFreq for it is 3, and for the same term but in the second clause
> that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; 
> *is
> a term being read with different docFreq values expected behaviour? *
>
>
>
> After flattening the nested disjunctions (part of query rewrite process),
> the query looks like below:
>
>
> ((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 
> (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 
> (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 
> (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | 
> ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)
>
>
>
> As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight
> has been summed up from the original query. This is the version of the
> query that actually gets used, and the docFreq here for the
> bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it
> shows as 2 between runs and final score changes accordingly to that. *Is
> this "coin toss" pick of docFreq for the same term expected behaviour? *
>
>
>
> Looks like the issue stems from one of the behaviours observed and
> highlighted in bold.
>
>
>
> Looking forward to hearing back from you.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Inconsistent query results in Lucene 8.1.0

Reply via email to