[
https://issues.apache.org/jira/browse/SOLR-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347410#comment-17347410
]
Alessandro Benedetti edited comment on SOLR-15407 at 5/19/21, 8:25 AM:
-----------------------------------------------------------------------
Hi David, first of all thanks for taking your time to think about this, it is
much appreciated.
In regards to:
{quote}sow=false implies the minimum should match is "per field"{quote}
I was thinking the same you think (i.e. sow to not affect mm, and mm to always
be "per document").
Then I spent some time investigating to write a dedicated advanced blog (coming
out in the next few days) and I verified that currently in 8.8.2 it's not the
case.
Now, I don't know if it's on purpose or not, but if you have multi-field
search, with different analysis per field, this is what you get (I post here a
piece of the upcoming blog):
When the query parsed moves from being term centric(sow=true) to field
centric(sow=false and different text analysis), mm means two different things:
mimimum of query terms matched, independently in which field (PER DOCUMENT)
{code:java}
sow = true
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united | subjects_as_same_term:united) (author:kingdom |
subjects_as_same_term:kingdom))~2)"
"response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
{
"id":"888888",
"author":"united",
"subjects":["kingdom"],
"score":7.757958},
{
"id":"77777",
"author":"united kingdom",
"score":5.874222}]
},
{code}
mimimum of query terms matched within the same field (i.e. all query terms
required must be found in one of the fields)
“PER FIELD”
{code:java}
sow = false
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united author:kingdom)~2) |
(((subjects_as_same_term:uk subjects_as_same_term:"united kingdom"
subjects_as_same_term:england subjects_as_same_term:london
subjects_as_same_term:british subjects_as_same_term:britain))~1))"
{code}
This (author:united author:kingdom)~2 means we need both the clauses to match
to have a good candidate, in disjunction with
(subjects_as_same_term:uk subjects_as_same_term:”united kingdom”
subjects_as_same_term:england subjects_as_same_term:london
subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we
need at least one clause to match (because synonyms expanded the two original
terms into a single one)
{code:java}
"response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
{
"id":"77777",
"author":"united kingdom",
"score":5.874222}]
}
{code}
was (Author: alessandro.benedetti):
Hi David, first of all thanks for taking your time to think about this, it is
much appreciated.
In regards to:
{quote}sow=false implies the minimum should match is "per field"{quote}
I was thinking the same you think (i.e. sow to not affect mm, and mm to always
be "per document").
Then I spent some time investigating to write a dedicated advanced blog (coming
out in the next few days) and I verified that currently in 8.8.2 it's not the
case.
Now, I don't know if it's on purpose or not, but if you have multi-field
search, with different analysis per field, this is what you get (I post here a
piece of the upcoming blog):
<!-- wp:paragraph -->
<p>The sow parameter affects the <a
href="https://solr.apache.org/guide/8_8/the-dismax-query-parser.html#mm-minimum-should-match-parameter">mm
parameter</a> .<br>When the query parsed moves from being term
centric(sow=true) to field centric(sow=false and different text analysis), mm
means two different things:<br>mimimum of query terms matched, independently in
which field (PER DOCUMENT)</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<pre class="wp-block-code"><code>sow = true
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united | subjects_as_same_term:united) (author:kingdom |
subjects_as_same_term:kingdom))~2)"</code></pre>
<!-- /wp:code -->
<!-- wp:code -->
<pre
class="wp-block-code"><code>"response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
{
"id":"888888",
"author":"united",
"subjects":["kingdom"],
"score":7.757958},
{
"id":"77777",
"author":"united kingdom",
"score":5.874222}]
},</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>mimimum of query terms matched within the same field (i.e. all query terms
required must be found in one of the fields)<br>"PER FIELD"</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<pre class="wp-block-code"><code>sow = false
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax
"parsedquery_toString":
"+(((author:united author:kingdom)~2) |
(((subjects_as_same_term:uk subjects_as_same_term:"united kingdom"
subjects_as_same_term:england subjects_as_same_term:london
subjects_as_same_term:british subjects_as_same_term:britain))~1))"</code></pre>
<!-- /wp:code -->
<!-- wp:paragraph -->
<p>This (author:united author:kingdom)~2 means we need both the clauses to
match to have a good candidate, in disjunction
with<br>(subjects_as_same_term:uk subjects_as_same_term:"united kingdom"
subjects_as_same_term:england subjects_as_same_term:london
subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we
need at least one clause to match (because synonyms expanded the two original
terms into a single one)</p>
<!-- /wp:paragraph -->
<!-- wp:code -->
<pre
class="wp-block-code"><code>"response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
{
"id":"77777",
"author":"united kingdom",
"score":5.874222}]
}</code></pre>
<!-- /wp:code -->
> eDismax sow=false doesn't work with string field types
> ------------------------------------------------------
>
> Key: SOLR-15407
> URL: https://issues.apache.org/jira/browse/SOLR-15407
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: query parsers
> Affects Versions: 8.8.2
> Reporter: Alessandro Benedetti
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently, the sow=false should not tokenize the input user query text and
> delegate to each field for query time text analysis.
> But what happens if one of the queries involved is not analyzed?
> For example, because it is a string field type?
> Terms are split and the query generated is broken:
> {code:java}
> assertU(adoc("id", "75", "trait_ss", "multi term"));
> public void testSplitOnWhitespace_stringField_shouldBuildSingleClause()
> throws Exception
> {
> assertJQ(req("qf", "trait_ss", "defType", "edismax", "q", "multi
> term", "sow", "false"),
> "/response/numFound==1", "/response/docs/[0]/id=='75'");
> String parsedquery;
> parsedquery = getParsedQuery(
> req("qf", "trait_ss", "q", "multi term", "defType", "edismax",
> "sow", "false", "debugQuery", "true"));
> assertThat(parsedquery, anyOf(containsString("((trait_ss:multi
> term))")));
> }
> {code}
> This test would be currently broken.
> The current parsed query is wrongly:
> (trait_ss:multi trait_ss:term)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]