[ 
https://issues.apache.org/jira/browse/LUCENE-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802680#comment-16802680
 ] 

ASF subversion and git services commented on LUCENE-8477:
---------------------------------------------------------

Commit f1782d0dd1195c823c79aab87529ebd7e8217b95 in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f1782d0 ]

LUCENE-8477: Automatically rewrite disjunctions when internal gaps matter (#620)

We have a number of IntervalsSource implementations where automatic 
minimization of
disjunctions can lead to surprising results:

* PHRASE queries can miss matches because a longer matching sub-source is 
minimized
  away, leaving a gap
* MAXGAPS queries can miss matches for the same reason
* CONTAINING, NOT_CONTAINING, CONTAINED_BY and NOT_CONTAINED_BY queries
  can miss matches if the 'big' interval gets minimized

The proper way to deal with this is to rewrite the queries by pulling 
disjunctions to the top
of the query tree, so that PHRASE("a", OR(PHRASE("b", "c"), "c")) is rewritten 
to
OR(PHRASE("a", "b", "c"), PHRASE("a", "c")). To be able to do this generally, 
we need to
add a new pullUpDisjunctions() method to IntervalsSource that performs this 
rewriting
for each source that it would apply to.

Because these rewritten queries will in general be less efficient due to the 
duplication of
effort (eg the rewritten PHRASE query above pulls 5 term iterators rather than 
4 in the
original), we also add an option to Intervals.or() that will prevent this 
happening, so that
consumers can choose speed over accuracy if it suits their usecase.

> Improve handling of inner disjunctions in intervals
> ---------------------------------------------------
>
>                 Key: LUCENE-8477
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8477
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8477.patch, LUCENE-8477.patch, LUCENE-8477.patch, 
> LUCENE-8477.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The current implementation of the disjunction interval produced by 
> {{Intervals.or}} is a direct implementation of the OR operator from the Vigna 
> paper.  This produces minimal intervals, meaning that (a) is preferred over 
> (a b), and (b) also over (a b).  This has advantages when it comes to 
> counting intervals for scoring, but also has drawbacks when it comes to 
> matching.  For example, a phrase query for ((a OR (a b)) BLOCK (c)) will not 
> match the document (a b c), because (a) will be preferred over (a b), and (a 
> c) does not match.
> This ticket is to discuss the best way of dealing with disjunctions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to