[
https://issues.apache.org/jira/browse/SOLR-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929124#comment-15929124
]
Steve Rowe edited comment on SOLR-9185 at 3/16/17 11:30 PM:
------------------------------------------------------------
Patch addressing the remaining issues. Precommit and all Solr tests pass. I
plan on committing this shortly so that it will make the 6.5 release.
Both edismax and the standard query parser are covered. I did not add this
feature to the dismax parser (or to any other Solr query parsers); if people
want this feature added elsewhere, we can do that under another issue.
Some implementation notes:
* As noted in previous comments on this issue, the feature is activated by
specifying request param {{sow=false}}. By default, {{sow=true}}; there is no
behavior change at all if the {{sow}} param is not specified.
* I ran {{TestSolrQueryParser.testParsingPerformance()}} under three
conditions: a) unpatched; b) patched using the default behavior (same as
{{sow=true}}); and c) patched with {{sow=false}} to activate the
don't-split-on-whitespace code. The best-of-ten results run in a bash loop on
my Linux box show all three within about 0.5% of each other's QPS (likely
noise): between 91K and 92K QPS. Average-of-ten puts the two patched
conditions at roughly 2% slower (88K vs. 90K QPS). I think this is acceptable.
* When per-field query structures differ, e.g. when one field's analyzer
removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
structure when {{sow=false}} differs from that produced when {{sow=true}}.
Briefly, {{sow=true}} produces a boolean query containing one dismax query per
query term, while {{sow=false}} produces a dismax query containing one boolean
query per field. Min-should-match processing does (what I think is) the right
thing here. See
{{TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()}}
for some examples of this. *Note*: when {{sow=false}} and all queried fields'
query structure is the same, edismax does what it has always done: produce a
boolean query containing one dismax query per term.
* There is a new test suite {{TestMultiWordSynonyms}} that shows multi-term
source synonyms matching at query-time.
* In order to deal with the set query changes introduced by SOLR-9786, I
extended {{SolrQueryParserBase.RawQuery}} to hold an array of terms, to enable
their later consumption as either a concatenated string (for tokenized fields)
or individually (for non-tokenized fields).
* As noted on LUCENE-7533 for Lucene's classic query parser (and equally
applicable to the Solr standard and edismax query parsers),
{{split-on-whitespace=false}} and {{autoGeneratePhraseQueries=true}} don't play
well together at present. I've introduced a new exception
{{QueryParserConfigurationException}} that will be thrown if any queried field
is configured with {{autoGeneratePhraseQueries=true}} when the {{sow=false}}
request param is specified. For edismax, this is a departure: it's supposed to
never throw exceptions. I think that's okay for now though, since this is an
optional/experimental feature. Maybe when {{sow=false}} becomes the default
(later, under another issue - see below), edismax should just log a warning and
produce a query that excludes fields with this problem?
After this has been committed, I'll make a new issue to switch the default
behavior on 7.0/master to {{sow=false}}.
was (Author: steve_rowe):
Patch addressing the remaining issues. Precommit and all Solr tests pass. I
plan on committing this shortly so that it will make the 6.5 release.
Both edismax and the standard query parser are covered. I did not add this
feature to the dismax parser (or to any other Solr query parsers); if people
want this feature added elsewhere, we can do that under another issue.
Some implementation notes:
* As noted in previous comments on this issue, the feature is activated by
specifying request param {{sow=false}}. By default, {{sow=true}}; there is no
behavior change at all if the {{sow}} param is not specified.
* I ran {{TestSolrQueryParser.testParsingPerformance()}} under three
conditions: a) unpatched; b) patched using the default behavior (same as
{{sow=true}}; and c) patched with {{sow=false}} to activate the
don't-split-on-whitespace code. The best-of-ten results run in a bash loop on
my Linux box show all three within about 0.5% of each other's QPS (likely
noise): between 91K and 92K QPS. Average-of-ten puts the two patched
conditions at roughly 2% slower (88K vs. 90K QPS). I think this is acceptable.
* When per-field query structures differ, e.g. when one field's analyzer
removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
structure when {{sow=false}} differs from that produced when {{sow=true}}.
Briefly, {{sow=true}} produces a boolean query containing one dismax query per
query term, while {{sow=false}} produces a dismax query containing one boolean
query per field. Min-should-match processing does (what I think is) the right
thing here. See
{{TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()}}
for some examples of this. *Note*: when {{sow=false}} and all queried fields'
query structure is the same, edismax does what it has always done: produce a
boolean query containing one dismax query per term.
* There is a new test suite {{TestMultiWordSynonyms}} that shows multi-term
source synonyms matching at query-time.
* In order to deal with the set query changes introduced by SOLR-9786, I
extended {{SolrQueryParserBase.RawQuery}} to hold an array of terms, to enable
their later consumption as either a concatenated string (for tokenized fields)
or individually (for non-tokenized fields).
* As noted on LUCENE-7533 for Lucene's classic query parser (and equally
applicable to the Solr standard and edismax query parsers),
{{split-on-whitespace=false}} and {{autoGeneratePhraseQueries=true}} don't play
well together at present. I've introduced a new exception
{{QueryParserConfigurationException}} that will be thrown if any queried field
is configured with {{autoGeneratePhraseQueries=true}} when the {{sow=false}}
request param is specified. For edismax, this is a departure: it's supposed to
never throw exceptions. I think that's okay for now though, since this is an
optional/experimental feature. Maybe when {{sow=false}} becomes the default
(later, under another issue - see below), edismax should just log a warning and
produce a query that excludes fields with this problem?
After this has been committed, I'll make a new issue to switch the default
behavior on 7.0/master to {{sow=false}}.
> Solr's edismax and "Lucene"/standard query parsers should not split on
> whitespace before sending terms to analysis
> ------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-9185
> URL: https://issues.apache.org/jira/browse/SOLR-9185
> Project: Solr
> Issue Type: Bug
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Attachments: SOLR-9185.patch, SOLR-9185.patch, SOLR-9185.patch,
> SOLR-9185.patch
>
>
> Copied from LUCENE-2605:
> The queryparser parses input on whitespace, and sends each whitespace
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across
> whitespace boundaries:
> n-gram analysis
> shingles
> synonyms (especially multi-word for whitespace-separated languages)
> languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their
> charfilters/tokenizers/tokenfilters will do the same thing at index and
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse
> around only real 'operators'.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]