[
https://issues.apache.org/jira/browse/SOLR-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe updated SOLR-10310:
------------------------------
Attachment: SOLR-10310.patch
Patch switching default {{sow}} to {{false}}.
All Solr tests pass, and precommit passes.
I think it's ready to go, but I'll wait a few days before committing in case
there are objections.
Two behavior changes result from this switch, as illustrated by tests:
1. When {{sow=false}}, {{autoGeneratePhraseQueries="true"}}, and words are
split (e.g. by WordDelimiterGraphFilter) but no overlapping terms are produced,
phrase queries are *not* produced - see LUCENE-7799 for a possible eventual
solution to this problem:
{code:java|title=TestSolrQueryParser.testPhrase()}
// "text" field's type has WordDelimiterGraphFilter (WDGFF) and
autoGeneratePhraseQueries=true
// should generate a phrase of "now cow" and match only one doc
assertQ(req("q", "text:now-cow", "indent", "true", "sow","true")
, "//*[@numFound='1']"
);
// When sow=false, autoGeneratePhraseQueries=true only works when a graph is
produced
// (i.e. overlapping terms, e.g. if WDGFF's preserveOriginal=1 or
concatenateWords=1).
// The WDGFF config on the "text" field doesn't produce a graph, so the
generated query
// is not a phrase query. As a result, docs can match that don't match phrase
query "now cow"
assertQ(req("q", "text:now-cow", "indent", "true", "sow","false")
, "//*[@numFound='2']"
);
assertQ(req("q", "text:now-cow", "indent", "true") // default sow=false
, "//*[@numFound='2']"
);
{code}
2. {{sow=false}} changes the queries edismax produces over multiple fields when
any of the fields’ query-time analysis differs from the other fields’, e.g. if
one field’s analyzer removes stopwords when another field’s doesn’t. In this
case, rather than a dismax-query-per-whitespace-separated-term (edismax’s
behavior when {{sow=true}}), a dismax-query-per-field is produced. This can
change results in general, but quite significantly when combined with the
{{mm}} (min-should-match) request parameter: since min-should-match applies per
field instead of per term, missing terms in one field’s analysis won’t
disqualify docs from matching.
{code:java|title=TestExtendedDismaxParser.testFocusQueryParser()}
assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100",
"qf","movies_t foo_i", "sow","true"),
nor);
// When sow=false, the per-field query structures differ (no "Terminator" query
on integer field foo_i),
// so a dismax-per-field is constructed. As a result, mm=100% is applied
per-field instead of per-term;
// since there is only one term (100) required in the foo_i field's dismax, the
query can match docs that
// only have the 100 term in the foo_i field, and don't necessarily have
"Terminator" in any field.
assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100",
"qf","movies_t foo_i", "sow","false"),
oner);
assertQ(req("defType","edismax", "mm","100%", "q","Terminator: 100",
"qf","movies_t foo_i"), // default sow=false
oner);
{code}
> By default, stop splitting on whitespace prior to analysis in edismax and
> "Lucene"/standard query parsers
> ---------------------------------------------------------------------------------------------------------
>
> Key: SOLR-10310
> URL: https://issues.apache.org/jira/browse/SOLR-10310
> Project: Solr
> Issue Type: Task
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Steve Rowe
> Attachments: SOLR-10310.patch
>
>
> SOLR-9185 introduced an option on the edismax and standard query parsers to
> not perform pre-analysis whitespace splitting: the {{sow=false}} request
> param.
> On master/7.0, we should make {{sow=false}} the default.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]