[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

Rudi Seitz (Jira) Wed, 21 Dec 2022 06:31:10 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
 ]


Rudi Seitz commented on SOLR-16594:
-----------------------------------

Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to 
field-centric shift. Tested in Solr 9.1.

Create collection using the default schema and index the following documents:

{{{"id":"1", "field1_ws":"XY GB"}}}
{{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}}
{{{"id":"3", "field1_ws":"XY GC"}}}
{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}

Note that default schema contains a synonym rule for GB which will be applied 
in _txt fields:

{{GB,gib,gigabyte,gigabytes}}

Now try the following edismax query for "GB MB" with "minimum should match" set 
to 100%:

{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}

{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws}}

Notice that BOTH document 1 and document 2 are returned. This is because 
edismax is generating a term-centric query which allows the terms "XY" and "GB" 
to match in any of the qf fields.

Now add the txt version of field2 to the qf:

{{qf=field1_ws field2_ws field2_txt}}

{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt}}

Rerun the query and notice that ONLY document 1 is returned. This is because 
field2_txt expands synonyms, which leads to a different number of tokens from 
the _ws fields, which causes edismax to generate a field-centric query, which 
requires that the terms "XY" and "GB" must both match in _one_ of the provided 
qf fields. It is counterintuitive that expanding the range of the search to 
include more fields actually _reduces_ recall here, but not elsewhere:

Repeat this experiment with {{q=XY GC}}

In this case, notice that BOTH documents are returned for both versions of qf – 
there is no change when we add field2_txt to qf. That is because there is no 
synonym rule for GC, so even though _ws and _txt fields have "incompatible" 
analysis chains they happen to generate the same number of tokens for this 
particular query and edismax is able to stay with the term-centric approach.

In these experiments we have been assuming the default {{{}sow=false{}}}. If we 
set {{sow=true}} we would see that the term-centric approach is used throughout 
and there is no change in behavior when we add field2_txt to qf, whether we are 
searching for "XY GB" or "XY GC".

 

 

 

> eDismax should use startOffset when converting per-field to per-term queries
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-16594
>                 URL: https://issues.apache.org/jira/browse/SOLR-16594
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>            Reporter: Rudi Seitz
>            Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes 
> switches from a "term-centric" to a "field-centric" approach. This creates 
> inconsistent semantics for the {{mm}} or "min should match" parameter and may 
> have an impact on scoring. The goal of this ticket is to improve the approach 
> that edismax uses for generating term-centric queries so that edismax would 
> less frequently "give up" and resort to the field-centric approach. 
> Specifically, we propose that edismax should create a dismax query for each 
> distinct startOffset found among the tokens emitted by the field analyzers. 
> Since the relevant code in edismax works with Query objects that contain 
> Terms, and since Terms do not hold the startOffset of the Token from which 
> Term was derived, some plumbing work would need to be done to make the 
> startOffsets available to edismax.
>  
> BACKGROUND:
>  
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric 
> interpretation of the query would contain a clause for each field:
> {{  (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{  (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the 
> tokens that emerge from each field's analysis chain and group them according 
> to the terms in the user's original query. However, the tokens that emerge 
> from an analysis chain do not store a reference to their corresponding input 
> terms. For example, if we pass "foo bar" through an ngram analyzer we would 
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it 
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo" 
> input term, and that "b", "ba", and "bar" come from the "bar" input term, 
> there is not always an easy way for edismax to see this connection. When 
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each 
> analysis chain separately, and therefore edismax "knows" that the output 
> tokens from any given analysis chain are all derived from the single input 
> term that was passed into that chain. However, when {{{}sow=false{}}}, 
> edismax passes the entire multi-term query through each analysis chain as a 
> whole, resulting in multiple output tokens that are not "connected" to their 
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by 
> first generating a boolean query for each field, and then checking whether 
> all of these per-field queries have the same structure. The structure will 
> generally be uniform if each analysis chain emits the same number of tokens 
> for the given input. If one chain has a synonym filter and another doesn’t, 
> this uniformity may depend on whether a synonym rule happened to match a term 
> in the user's input. 
> Assuming the per-field boolean queries _do_ have the same structure, edismax 
> reorganizes them into a new boolean query. The new query contains a dismax 
> for each clause position in the original queries. If the original queries are 
> {{(f1:foo f1:bar) }}and {{(f2:foo f2:bar)}} we can see they have two clauses 
> each, so we would get a dismax containing all the first position clauses 
> {{(f1:foo f1:bar)}} and another dismax containing all the second position 
> clauses {{{}(f2:foo f2:bar){}}}.
> We can see that edismax is using clause position as a heuristic to reorganize 
> the per-field boolean queries into per-term ones, even though it doesn't know 
> for sure which clauses inside those per-field boolean queries are related to 
> which input terms. We propose that a better way of reorganizing the per-field 
> boolean queries is to create a dismax for each distinct startOffset seen 
> among the tokens in the token streams emitted by each field analyzer. The 
> startOffset of a token (rather, a PackedTokenAttributeImpl) is "the position 
> of the first character corresponding to this token in the source text".
> We propose that startOffset is a resonable way of matching output tokens up 
> with the input terms that gave rise to them. For example, if we pass "foo 
> bar" through an ngram analysis chain we see that the foo-related tokens all 
> have startOffset=0 while the bar-related tokens all have startOffset=4. 
> Likewise, tokens that are generated via synonym expansion have a startOffset 
> that points to the beginning of the matching input term. For example, if the 
> query "GB" generates "GB gib gigabyte gigabytes" via synonym expansion, all 
> of those four tokens would have startOffset=0.
> Here's an example of how the proposed edismax logic would work. Let's say a 
> user searches for "foo bar" across two fields, f1 and f2, where f1 uses a 
> standard text analysis chain while f2 generates ngrams. We would get 
> field-centric queries {{(f1:foo f1:bar)}} and ({{{}f2:f f2:fo f2:foo f2:b 
> f2:ba f2:bar){}}}. Edismax's "all same query structure" check would fail 
> here, but if we look for the unique startOffsets seen among all the tokens we 
> would find offsets 0 and 4. We could then generate one clause for all the 
> startOffset=0 tokens {{(f1:foo f2:f f2:fo f2:foo)}} and another for all the 
> startOffset=4 tokens: {{{}(f1:bar f2:b f2:ba f2:bar){}}}. This would 
> effectively give us a "term-centric" query with consistent mm and scoring 
> semantics, even though the analysis chains are not "compatible."
> As mentioned, there would be significant plumbing needed to make startOffsets 
> available to edismax in the code where the per-field queries are converted 
> into per-term queries. Modifications would possibly be needed in both the 
> Solr and Lucene repos. This ticket is logged in hopes of gathering feedback 
> about whether this is a worthwhile/viable approach to pursue further.
>  
> Related tickets:
> https://issues.apache.org/jira/browse/SOLR-12779
> https://issues.apache.org/jira/browse/SOLR-15407
>  
> Related blog entries:
> [https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities]
> [https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-16594) eDismax should use startOffset when converting per-field to per-term queries

Reply via email to