[jira] [Comment Edited] (SOLR-9185) Solr's edismax and "Lucene"/standard query parsers should not split on whitespace before sending terms to analysis

Steve Rowe (JIRA) Thu, 16 Mar 2017 16:32:29 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929124#comment-15929124
 ]


Steve Rowe edited comment on SOLR-9185 at 3/16/17 11:30 PM:
------------------------------------------------------------

Patch addressing the remaining issues.  Precommit and all Solr tests pass.  I 
plan on committing this shortly so that it will make the 6.5 release.

Both edismax and the standard query parser are covered.  I did not add this 
feature to the dismax parser (or to any other Solr query parsers); if people 
want this feature added elsewhere, we can do that under another issue.

Some implementation notes:

* As noted in previous comments on this issue, the feature is activated by 
specifying request param {{sow=false}}.  By default, {{sow=true}}; there is no 
behavior change at all if the {{sow}} param is not specified.
* I ran {{TestSolrQueryParser.testParsingPerformance()}} under three 
conditions: a) unpatched; b) patched using the default behavior (same as 
{{sow=true}}); and c) patched with {{sow=false}} to activate the 
don't-split-on-whitespace code.  The best-of-ten results run in a bash loop on 
my Linux box show all three within about 0.5% of each other's QPS (likely 
noise): between 91K and 92K QPS.  Average-of-ten puts the two patched 
conditions at roughly 2% slower (88K vs. 90K QPS).  I think this is acceptable.
* When per-field query structures differ, e.g. when one field's analyzer 
removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery 
structure when {{sow=false}} differs from that produced when {{sow=true}}.  
Briefly, {{sow=true}} produces a boolean query containing one dismax query per 
query term, while {{sow=false}} produces a dismax query containing one boolean 
query per field. Min-should-match processing does (what I think is) the right 
thing here. See 
{{TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()}} 
for some examples of this. *Note*: when {{sow=false}} and all queried fields' 
query structure is the same, edismax does what it has always done: produce a 
boolean query containing one dismax query per term.
* There is a new test suite {{TestMultiWordSynonyms}} that shows multi-term 
source synonyms matching at query-time.
* In order to deal with the set query changes introduced by SOLR-9786, I 
extended {{SolrQueryParserBase.RawQuery}} to hold an array of terms, to enable 
their later consumption as either a concatenated string (for tokenized fields) 
or individually (for non-tokenized fields).
* As noted on LUCENE-7533 for Lucene's classic query parser (and equally 
applicable to the Solr standard and edismax query parsers), 
{{split-on-whitespace=false}} and {{autoGeneratePhraseQueries=true}} don't play 
well together at present.  I've introduced a new exception 
{{QueryParserConfigurationException}} that will be thrown if any queried field 
is configured with {{autoGeneratePhraseQueries=true}} when the {{sow=false}} 
request param is specified.  For edismax, this is a departure: it's supposed to 
never throw exceptions.  I think that's okay for now though, since this is an 
optional/experimental feature.  Maybe when {{sow=false}} becomes the default 
(later, under another issue - see below), edismax should just log a warning and 
produce a query that excludes fields with this problem?

After this has been committed, I'll make a new issue to switch the default 
behavior on 7.0/master to {{sow=false}}.


was (Author: steve_rowe):
Patch addressing the remaining issues.  Precommit and all Solr tests pass.  I 
plan on committing this shortly so that it will make the 6.5 release.

Both edismax and the standard query parser are covered.  I did not add this 
feature to the dismax parser (or to any other Solr query parsers); if people 
want this feature added elsewhere, we can do that under another issue.

Some implementation notes:

* As noted in previous comments on this issue, the feature is activated by 
specifying request param {{sow=false}}.  By default, {{sow=true}}; there is no 
behavior change at all if the {{sow}} param is not specified.
* I ran {{TestSolrQueryParser.testParsingPerformance()}} under three 
conditions: a) unpatched; b) patched using the default behavior (same as 
{{sow=true}}; and c) patched with {{sow=false}} to activate the 
don't-split-on-whitespace code.  The best-of-ten results run in a bash loop on 
my Linux box show all three within about 0.5% of each other's QPS (likely 
noise): between 91K and 92K QPS.  Average-of-ten puts the two patched 
conditions at roughly 2% slower (88K vs. 90K QPS).  I think this is acceptable.
* When per-field query structures differ, e.g. when one field's analyzer 
removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery 
structure when {{sow=false}} differs from that produced when {{sow=true}}.  
Briefly, {{sow=true}} produces a boolean query containing one dismax query per 
query term, while {{sow=false}} produces a dismax query containing one boolean 
query per field. Min-should-match processing does (what I think is) the right 
thing here. See 
{{TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()}} 
for some examples of this. *Note*: when {{sow=false}} and all queried fields' 
query structure is the same, edismax does what it has always done: produce a 
boolean query containing one dismax query per term.
* There is a new test suite {{TestMultiWordSynonyms}} that shows multi-term 
source synonyms matching at query-time.
* In order to deal with the set query changes introduced by SOLR-9786, I 
extended {{SolrQueryParserBase.RawQuery}} to hold an array of terms, to enable 
their later consumption as either a concatenated string (for tokenized fields) 
or individually (for non-tokenized fields).
* As noted on LUCENE-7533 for Lucene's classic query parser (and equally 
applicable to the Solr standard and edismax query parsers), 
{{split-on-whitespace=false}} and {{autoGeneratePhraseQueries=true}} don't play 
well together at present.  I've introduced a new exception 
{{QueryParserConfigurationException}} that will be thrown if any queried field 
is configured with {{autoGeneratePhraseQueries=true}} when the {{sow=false}} 
request param is specified.  For edismax, this is a departure: it's supposed to 
never throw exceptions.  I think that's okay for now though, since this is an 
optional/experimental feature.  Maybe when {{sow=false}} becomes the default 
(later, under another issue - see below), edismax should just log a warning and 
produce a query that excludes fields with this problem?

After this has been committed, I'll make a new issue to switch the default 
behavior on 7.0/master to {{sow=false}}.

> Solr's edismax and "Lucene"/standard query parsers should not split on 
> whitespace before sending terms to analysis
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9185
>                 URL: https://issues.apache.org/jira/browse/SOLR-9185
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>         Attachments: SOLR-9185.patch, SOLR-9185.patch, SOLR-9185.patch, 
> SOLR-9185.patch
>
>
> Copied from LUCENE-2605:
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> n-gram analysis
> shingles
> synonyms (especially multi-word for whitespace-separated languages)
> languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-9185) Solr's edismax and "Lucene"/standard query parsers should not split on whitespace before sending terms to analysis

Reply via email to