[
https://issues.apache.org/jira/browse/SOLR-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700766#comment-17700766
]
Rudi Seitz commented on SOLR-16594:
-----------------------------------
PR: [https://github.com/apache/solr/pull/1463]
> improve eDismax strategy for generating a term-centric query
> ------------------------------------------------------------
>
> Key: SOLR-16594
> URL: https://issues.apache.org/jira/browse/SOLR-16594
> Project: Solr
> Issue Type: Improvement
> Components: query parsers
> Reporter: Rudi Seitz
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When parsing a multi-term query that spans multiple fields, edismax attempts
> to generate a term-centric query structure
>
> sometimes switches from a "term-centric" to a "field-centric" approach. This
> creates inconsistent semantics for the {{mm}} or "min should match" parameter
> and may have an impact on scoring. The goal of this ticket is to improve the
> approach that edismax uses for generating term-centric queries so that
> edismax would less frequently "give up" and resort to the field-centric
> approach. Specifically, we propose that edismax should create a dismax query
> for each distinct startOffset found among the tokens emitted by the field
> analyzers. Since the relevant code in edismax works with Query objects that
> contain Terms, and since Terms do not hold the startOffset of the Token from
> which Term was derived, some plumbing work would need to be done to make the
> startOffsets available to edismax.
>
> BACKGROUND:
>
> If a user searches for "foo bar" with {{{}qf=f1 f2{}}}, a field-centric
> interpretation of the query would contain a clause for each field:
> {{ (f1:foo f1:bar) (f2:foo f2:bar)}}
> while a term-centric interpretation would contain a clause for each term:
> {{ (f1:foo f2:foo) (f1:bar f2:bar)}}
> The challenge in generating a term-centric query is that we need to take the
> tokens that emerge from each field's analysis chain and group them according
> to the terms in the user's original query. However, the tokens that emerge
> from an analysis chain do not store a reference to their corresponding input
> terms. For example, if we pass "foo bar" through an ngram analyzer we would
> get a token stream containing "f", "fo", "foo", "b", "ba", "bar". While it
> may be obvious to a human that "f", "fo", and "foo" all come from the "foo"
> input term, and that "b", "ba", and "bar" come from the "bar" input term,
> there is not always an easy way for edismax to see this connection. When
> {{{}sow=true{}}}, edismax passes each whitespace-separated term through each
> analysis chain separately, and therefore edismax "knows" that the output
> tokens from any given analysis chain are all derived from the single input
> term that was passed into that chain. However, when {{{}sow=false{}}},
> edismax passes the entire multi-term query through each analysis chain as a
> whole, resulting in multiple output tokens that are not "connected" to their
> source term.
> Edismax still tries to generate a term-centric query when {{sow=false}} by
> first generating a boolean query for each field, and then checking whether
> all of these per-field queries have the same structure. The structure will
> generally be uniform if each analysis chain emits the same number of tokens
> for the given input. If one chain has a synonym filter and another doesn’t,
> this uniformity may depend on whether a synonym rule happened to match a term
> in the user's input.
> Assuming the per-field boolean queries _do_ have the same structure, edismax
> reorganizes them into a new boolean query. The new query contains a dismax
> for each clause position in the original queries. If the original queries are
> {{(f1:foo f1:bar)}} and {{(f2:foo f2:bar)}} we can see they have two clauses
> each, so we would get a dismax containing all the first position clauses
> {{(f1:foo f1:bar)}} and another dismax containing all the second position
> clauses {{{}(f2:foo f2:bar){}}}.
> We can see that edismax is using clause position as a heuristic to reorganize
> the per-field boolean queries into per-term ones, even though it doesn't know
> for sure which clauses inside those per-field boolean queries are related to
> which input terms. We propose that a better way of reorganizing the per-field
> boolean queries is to create a dismax for each distinct startOffset seen
> among the tokens in the token streams emitted by each field analyzer. The
> startOffset of a token (rather, a PackedTokenAttributeImpl) is "the position
> of the first character corresponding to this token in the source text".
> We propose that startOffset is a resonable way of matching output tokens up
> with the input terms that gave rise to them. For example, if we pass "foo
> bar" through an ngram analysis chain we see that the foo-related tokens all
> have startOffset=0 while the bar-related tokens all have startOffset=4.
> Likewise, tokens that are generated via synonym expansion have a startOffset
> that points to the beginning of the matching input term. For example, if the
> query "GB" generates "GB gib gigabyte gigabytes" via synonym expansion, all
> of those four tokens would have startOffset=0.
> Here's an example of how the proposed edismax logic would work. Let's say a
> user searches for "foo bar" across two fields, f1 and f2, where f1 uses a
> standard text analysis chain while f2 generates ngrams. We would get
> field-centric queries {{(f1:foo f1:bar)}} and ({{{}f2:f f2:fo f2:foo f2:b
> f2:ba f2:bar){}}}. Edismax's "all same query structure" check would fail
> here, but if we look for the unique startOffsets seen among all the tokens we
> would find offsets 0 and 4. We could then generate one clause for all the
> startOffset=0 tokens {{(f1:foo f2:f f2:fo f2:foo)}} and another for all the
> startOffset=4 tokens: {{{}(f1:bar f2:b f2:ba f2:bar){}}}. This would
> effectively give us a "term-centric" query with consistent mm and scoring
> semantics, even though the analysis chains are not "compatible."
> As mentioned, there would be significant plumbing needed to make startOffsets
> available to edismax in the code where the per-field queries are converted
> into per-term queries. Modifications would possibly be needed in both the
> Solr and Lucene repos. This ticket is logged in hopes of gathering feedback
> about whether this is a worthwhile/viable approach to pursue further.
>
> Related tickets:
> https://issues.apache.org/jira/browse/SOLR-12779
> https://issues.apache.org/jira/browse/SOLR-15407
>
> Related blog entries:
> [https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities]
> [https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]