[
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17650895#comment-17650895
]
Rudi Seitz commented on SOLR-16594:
-----------------------------------
Steps to reproduce inconsistent {{mm}} behavior caused by term-centric to
field-centric shift. Tested in Solr 9.1.
Create collection using the default schema and index the following documents:
{{{"id":"1", "field1_ws":"XY GB"}}}
{{{"id":"2", "field1_ws":"XY", "field2_ws":"GB", "field2_txt":"GB"}}}
{{{"id":"3", "field1_ws":"XY GC"}}}
{{{"id":"4", "field1_ws":"XY", "field2_ws":"GC", "field2_txt":"GC"}}}
Note that default schema contains a synonym rule for GB which will be applied
in _txt fields:
{{GB,gib,gigabyte,gigabytes}}
Now try the following edismax query for "GB MB" with "minimum should match" set
to 100%:
{{q=XY GB}}
{{mm=100%}}
{{qf=field1_ws field2_ws}}
{{defType=edismax}}
{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws}}
Notice that BOTH document 1 and document 2 are returned. This is because
edismax is generating a term-centric query which allows the terms "XY" and "GB"
to match in any of the qf fields.
Now add the txt version of field2 to the qf:
{{qf=field1_ws field2_ws field2_txt}}
{{http://localhost:8983/solr/test/select?defType=edismax&indent=true&mm=100%25&q.op=OR&q=XY%20GB&qf=field1_ws%20field2_ws%20field2_txt}}
Rerun the query and notice that ONLY document 1 is returned. This is because
field2_txt expands synonyms, which leads to a different number of tokens from
the _ws fields, which causes edismax to generate a field-centric query, which
requires that the terms "XY" and "GB" must both match in _one_ of the provided
qf fields. It is counterintuitive that expanding the range of the search to
include more fields actually _reduces_ recall here, but not elsewhere:
Repeat this experiment with {{q=XY GC}}
In this case, notice that BOTH documents are returned for both versions of qf –
there is no change when we add field2_txt to qf. That is because there is no
synonym rule for GC, so even though _ws and _txt fields have "incompatible"
analysis chains they happen to generate the same number of tokens for this
particular query and edismax is able to stay with the term-centric approach.
In these experiments we have been assuming the default {{{}sow=false{}}}. If we
set {{sow=true}} we would see that the term-centric approach is used throughout
and there is no change in behavior when we add field2_txt to qf, whether we are
searching for "XY GB" or "XY GC".
> eDismax should use startOffset when converting per-field to per-term queries
> ----------------------------------------------------------------------------
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: query parsers
> Reporter: Rudi Seitz
> Priority: Major
>
> When parsing a multi-term query that spans multiple fields, edismax sometimes
> switches from a "term-centric" to a "field-centric" approach. This creates
> inconsistent semantics for the {{mm}} or "min should match" parameter and may
> have an impact on scoring. The goal of this ticket is to improve the approach
> that edismax uses for generating term-centric queries so that edismax would
> less frequently "give up" and resort to the field-centric approach.
> Specifically, we propose that edismax should create a dismax query for each
> distinct startOffset found among the tokens emitted by the field analyzers.
> Since the relevant code in edismax works with Query objects that contain
> Terms, and since Terms do not hold the startOffset of the Token from which
> Term was derived, some plumbing work would need to be done to make the
> startOffsets available to edismax.
>
> BACKGROUND:
>
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric
> interpretation of the query would contain a clause for each field:
> {{ (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{ (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the
> tokens that emerge from each field's analysis chain and group them according
> to the terms in the user's original query. However, the tokens that emerge
> from an analysis chain do not store a reference to their corresponding input
> terms. For example, if we pass "foo bar" through an ngram analyzer we would
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo"
> input term, and that "b", "ba", and "bar" come from the "bar" input term,
> there is not always an easy way for edismax to see this connection. When
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each
> analysis chain separately, and therefore edismax "knows" that the output
> tokens from any given analysis chain are all derived from the single input
> term that was passed into that chain. However, when {{{}sow=false{}}},
> edismax passes the entire multi-term query through each analysis chain as a
> whole, resulting in multiple output tokens that are not "connected" to their
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by
> first generating a boolean query for each field, and then checking whether
> all of these per-field queries have the same structure. The structure will
> generally be uniform if each analysis chain emits the same number of tokens
> for the given input. If one chain has a synonym filter and another doesn’t,
> this uniformity may depend on whether a synonym rule happened to match a term
> in the user's input.
> Assuming the per-field boolean queries _do_ have the same structure, edismax
> reorganizes them into a new boolean query. The new query contains a dismax
> for each clause position in the original queries. If the original queries are
> {{(f1:foo f1:bar) }}and {{(f2:foo f2:bar)}} we can see they have two clauses
> each, so we would get a dismax containing all the first position clauses
> {{(f1:foo f1:bar)}} and another dismax containing all the second position
> clauses {{{}(f2:foo f2:bar){}}}.
> We can see that edismax is using clause position as a heuristic to reorganize
> the per-field boolean queries into per-term ones, even though it doesn't know
> for sure which clauses inside those per-field boolean queries are related to
> which input terms. We propose that a better way of reorganizing the per-field
> boolean queries is to create a dismax for each distinct startOffset seen
> among the tokens in the token streams emitted by each field analyzer. The
> startOffset of a token (rather, a PackedTokenAttributeImpl) is "the position
> of the first character corresponding to this token in the source text".
> We propose that startOffset is a resonable way of matching output tokens up
> with the input terms that gave rise to them. For example, if we pass "foo
> bar" through an ngram analysis chain we see that the foo-related tokens all
> have startOffset=0 while the bar-related tokens all have startOffset=4.
> Likewise, tokens that are generated via synonym expansion have a startOffset
> that points to the beginning of the matching input term. For example, if the
> query "GB" generates "GB gib gigabyte gigabytes" via synonym expansion, all
> of those four tokens would have startOffset=0.
> Here's an example of how the proposed edismax logic would work. Let's say a
> user searches for "foo bar" across two fields, f1 and f2, where f1 uses a
> standard text analysis chain while f2 generates ngrams. We would get
> field-centric queries {{(f1:foo f1:bar)}} and ({{{}f2:f f2:fo f2:foo f2:b
> f2:ba f2:bar){}}}. Edismax's "all same query structure" check would fail
> here, but if we look for the unique startOffsets seen among all the tokens we
> would find offsets 0 and 4. We could then generate one clause for all the
> startOffset=0 tokens {{(f1:foo f2:f f2:fo f2:foo)}} and another for all the
> startOffset=4 tokens: {{{}(f1:bar f2:b f2:ba f2:bar){}}}. This would
> effectively give us a "term-centric" query with consistent mm and scoring
> semantics, even though the analysis chains are not "compatible."
> As mentioned, there would be significant plumbing needed to make startOffsets
> available to edismax in the code where the per-field queries are converted
> into per-term queries. Modifications would possibly be needed in both the
> Solr and Lucene repos. This ticket is logged in hopes of gathering feedback
> about whether this is a worthwhile/viable approach to pursue further.
>
> Related tickets:
> https://issues.apache.org/jira/browse/SOLR-12779
> https://issues.apache.org/jira/browse/SOLR-15407
>
> Related blog entries:
> [https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities]
> [https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]