[jira] [Updated] (SOLR-10310) By default, stop splitting on whitespace prior to analysis in edismax and "Lucene"/standard query parsers

Steve Rowe (JIRA) Fri, 21 Apr 2017 16:29:17 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Rowe updated SOLR-10310:
------------------------------
    Attachment: SOLR-10310.patch

Patch switching default {{sow}} to {{false}}.

All Solr tests pass, and precommit passes.

I think it's ready to go, but I'll wait a few days before committing in case 
there are objections.

Two behavior changes result from this switch, as illustrated by tests:

1. When {{sow=false}}, {{autoGeneratePhraseQueries="true"}}, and words are 
split (e.g. by WordDelimiterGraphFilter) but no overlapping terms are produced, 
phrase queries are *not* produced - see LUCENE-7799 for a possible eventual 
solution to this problem:

{code:java|title=TestSolrQueryParser.testPhrase()}
// "text" field's type has WordDelimiterGraphFilter (WDGFF) and 
autoGeneratePhraseQueries=true
// should generate a phrase of "now cow" and match only one doc
assertQ(req("q", "text:now-cow", "indent", "true", "sow","true")
    , "//*[@numFound='1']"
);
// When sow=false, autoGeneratePhraseQueries=true only works when a graph is 
produced
// (i.e. overlapping terms, e.g. if WDGFF's preserveOriginal=1 or 
concatenateWords=1).
// The WDGFF config on the "text" field doesn't produce a graph, so the 
generated query
// is not a phrase query.  As a result, docs can match that don't match phrase 
query "now cow"
assertQ(req("q", "text:now-cow", "indent", "true", "sow","false")
    , "//*[@numFound='2']"
);
assertQ(req("q", "text:now-cow", "indent", "true") // default sow=false
    , "//*[@numFound='2']"
);
{code}

2. {{sow=false}} changes the queries edismax produces over multiple fields when 
any of the fields’ query-time analysis differs from the other fields’, e.g. if 
one field’s analyzer removes stopwords when another field’s doesn’t. In this 
case, rather than a dismax-query-per-whitespace-separated-term (edismax’s 
behavior when {{sow=true}}), a dismax-query-per-field is produced. This can 
change results in general, but quite significantly when combined with the 
{{mm}} (min-should-match) request parameter: since min-should-match applies per 
field instead of per term, missing terms in one field’s analysis won’t 
disqualify docs from matching.

{code:java|title=TestExtendedDismaxParser.testFocusQueryParser()}
assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100", 
"qf","movies_t foo_i", "sow","true"),
        nor);
// When sow=false, the per-field query structures differ (no "Terminator" query 
on integer field foo_i),
// so a dismax-per-field is constructed.  As a result, mm=100% is applied 
per-field instead of per-term;
// since there is only one term (100) required in the foo_i field's dismax, the 
query can match docs that
// only have the 100 term in the foo_i field, and don't necessarily have 
"Terminator" in any field.
assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100", 
"qf","movies_t foo_i", "sow","false"),
        oner);
assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100", 
"qf","movies_t foo_i"), // default sow=false
    oner);
{code}

> By default, stop splitting on whitespace prior to analysis in edismax and 
> "Lucene"/standard query parsers
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-10310
>                 URL: https://issues.apache.org/jira/browse/SOLR-10310
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Steve Rowe
>         Attachments: SOLR-10310.patch
>
>
> SOLR-9185 introduced an option on the edismax and standard query parsers to 
> not perform pre-analysis whitespace splitting: the {{sow=false}} request 
> param.
> On master/7.0, we should make {{sow=false}} the default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-10310) By default, stop splitting on whitespace prior to analysis in edismax and "Lucene"/standard query parsers

Reply via email to