[
https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945719#comment-13945719
]
Brian Carver commented on SOLR-2649:
------------------------------------
I've been following this for at least two years. See my comment from February
2012, above. I can't tell if the proposed fix is a fix. We ought to have the
goal: the system behaves in a deterministic way that can be explained to users
and that, as little as possible, acts in ways contrary to user expectations
(especially silently). The failure to abide by this principle is what made this
issue so troubling to me, because users could know that whitespace would be
interpreted as "AND" yet they would still get results that discarded the effect
that operator should have had.
Now, of course, users make mistakes. They submit ambiguous queries (or in the
case of mm=100% for a disjunctive query, I guess we could call that a
self-defeating or self-contradictory query--if I understand mm correctly).
I still think that what is really needed is (1) a set of default rules for
interpreting ambiguous queries that will always provide a deterministic result.
These rules could be explained to users, and then what is also needed is that
(2) when a user does something that doesn't make sense, given these default
rules, they should get an error message.
The ambiguous query discussed above was one where whitespace was set to "AND"
and a user entered:
(A or B or C) "D E"
Such a user must be assuming that whitespace within quotation marks is ignored,
i.e., that the quotation marks make "D E" a single term that must be matched
exactly and that, given the default to conjunction for non-quoted whitespace,
that her query will be parsed as:
(A or B or C) AND "D E"
that is, as a conjunction with two conjuncts, thus requiring that each conjunct
be satisfied to get a matching result.
My first question then is, what will happen to this query under the new patch?
Will it be interpreted as expected?
My second question is, why not adopt a set of default rules for ambiguous
queries? Like the default order of operations in arithmetic, we simply need a
convention to interpret 3 + 5 x 4 as 3 + (5 x 4). Just as it didn't matter in
arithmetic which operators we favored, so long as everyone knows the
convention, it also doesn't really matter what rules we adopt here, so long as
we publicize them so users and maintainers know what to expect. I would propose
the following:
1. Whitespace within quotation marks is ignored (in that it is not turned into
an operator), that is "D E" is interpreted as a single term that must match
exactly.
2. If a query lacks sufficient parentheses to create an unambiguous query, then
the following rules will be applied:
a. Insert parentheses around every occurrence of AND and its two conjuncts,
starting with the rightmost AND.
b. Insert parentheses in the same fashion for OR.
c. Right parentheses are never inserted within another set of parentheses,
i.e., no existing pairings are broken up.
3. If one's query is nonsensical, an error message will be displayed explaining
the problem. For example, if one has set mm to 100%, requiring every term to
match, but yet one also issues a disjunctive query (A OR B) that would be
satisfied if either term were to match, then one receives an error indicating
that mm cannot be set to 100% while issuing a disjunctive query.
I think those rules would be sufficient to resolve all ambiguous queries and
the general idea that "If you leave out parentheses, then they'll be added to
the smallest available units starting from the right, and starting with
conjunction" is one that users could (somewhat) easily grasp.
But, as I said in 2012, my grasp on how solr handles mm is tenuous at best, so
perhaps someone will explain that I'm misunderstanding something important.
> MM ignored in edismax queries with operators
> --------------------------------------------
>
> Key: SOLR-2649
> URL: https://issues.apache.org/jira/browse/SOLR-2649
> Project: Solr
> Issue Type: Bug
> Components: query parsers
> Reporter: Magnus Bergmark
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: SOLR-2649.diff, SOLR-2649.patch
>
>
> Hypothetical scenario:
> 1. User searches for "stocks oil gold" with MM set to "50%"
> 2. User adds "-stockings" to the query: "stocks oil gold -stockings"
> 3. User gets no hits since MM was ignored and all terms where AND-ed
> together
> The behavior seems to be intentional, although the reason why is never
> explained:
> // For correct lucene queries, turn off mm processing if there
> // were explicit operators (except for AND).
> boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0;
> (lines 232-234 taken from
> tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java)
> This makes edismax unsuitable as an replacement to dismax; mm is one of the
> primary features of dismax.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]