[ 
https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945719#comment-13945719
 ] 

Brian Carver commented on SOLR-2649:
------------------------------------

I've been following this for at least two years. See my comment from February 
2012, above. I can't tell if the proposed fix is a fix. We ought to have the 
goal: the system behaves in a deterministic way that can be explained to users 
and that, as little as possible, acts in ways contrary to user expectations 
(especially silently). The failure to abide by this principle is what made this 
issue so troubling to me, because users could know that whitespace would be 
interpreted as "AND" yet they would still get results that discarded the effect 
that operator should have had.

Now, of course, users make mistakes. They submit ambiguous queries (or in the 
case of mm=100% for a disjunctive query, I guess we could call that a 
self-defeating or self-contradictory query--if I understand mm correctly).

I still think that what is really needed is (1) a set of default rules for 
interpreting ambiguous queries that will always provide a deterministic result. 
These rules could be explained to users, and then what is also needed is that 
(2) when a user does something that doesn't make sense, given these default 
rules, they should get an error message.

The ambiguous query discussed above was one where whitespace was set to "AND" 
and a user entered:
(A or B or C) "D E"

Such a user must be assuming that whitespace within quotation marks is ignored, 
i.e., that the quotation marks make "D E" a single term that must be matched 
exactly and that, given the default to conjunction for non-quoted whitespace, 
that her query will be parsed as:
(A or B or C) AND "D E"

that is, as a conjunction with two conjuncts, thus requiring that each conjunct 
be satisfied to get a matching result.

My first question then is, what will happen to this query under the new patch? 
Will it be interpreted as expected?

My second question is, why not adopt a set of default rules for ambiguous 
queries? Like the default order of operations in arithmetic, we simply need a 
convention to interpret 3 + 5 x 4 as 3 + (5 x 4). Just as it didn't matter in 
arithmetic which operators we favored, so long as everyone knows the 
convention, it also doesn't really matter what rules we adopt here, so long as 
we publicize them so users and maintainers know what to expect. I would propose 
the following:

1. Whitespace within quotation marks is ignored (in that it is not turned into 
an operator), that is "D E" is interpreted as a single term that must match 
exactly.
2. If a query lacks sufficient parentheses to create an unambiguous query, then 
the following rules will be applied:
a. Insert parentheses around every occurrence of AND and its two conjuncts, 
starting with the rightmost AND.
b. Insert parentheses in the same fashion for OR.
c. Right parentheses are never inserted within another set of parentheses, 
i.e., no existing pairings are broken up.
3. If one's query is nonsensical, an error message will be displayed explaining 
the problem. For example, if one has set mm to 100%, requiring every term to 
match, but yet one also issues a disjunctive query (A OR B) that would be 
satisfied if either term were to match, then one receives an error indicating 
that mm cannot be set to 100% while issuing a disjunctive query.

I think those rules would be sufficient to resolve all ambiguous queries and 
the general idea that "If you leave out parentheses, then they'll be added to 
the smallest available units starting from the right, and starting with 
conjunction" is one that users could (somewhat) easily grasp.

But, as I said in 2012, my grasp on how solr handles mm is tenuous at best, so 
perhaps someone will explain that I'm misunderstanding something important.

> MM ignored in edismax queries with operators
> --------------------------------------------
>
>                 Key: SOLR-2649
>                 URL: https://issues.apache.org/jira/browse/SOLR-2649
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>            Reporter: Magnus Bergmark
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: SOLR-2649.diff, SOLR-2649.patch
>
>
> Hypothetical scenario:
>   1. User searches for "stocks oil gold" with MM set to "50%"
>   2. User adds "-stockings" to the query: "stocks oil gold -stockings"
>   3. User gets no hits since MM was ignored and all terms where AND-ed 
> together
> The behavior seems to be intentional, although the reason why is never 
> explained:
>   // For correct lucene queries, turn off mm processing if there
>   // were explicit operators (except for AND).
>   boolean doMinMatched = (numOR + numNOT + numPluses + numMinuses) == 0; 
> (lines 232-234 taken from 
> tags/lucene_solr_3_3/solr/src/java/org/apache/solr/search/ExtendedDismaxQParserPlugin.java)
> This makes edismax unsuitable as an replacement to dismax; mm is one of the 
> primary features of dismax.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to