[
https://issues.apache.org/jira/browse/LUCENE-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793794#comment-16793794
]
Alan Woodward commented on LUCENE-8477:
---------------------------------------
Here is a proposal to fix this, using the new QueryVisitor API to work out if
disjunctions have any sub-clauses with common first terms. Given an interval
{{BLOCK(a,or(BLOCK(b,c),b),d)}} we can ensure that all matches are collected by
rewriting things so that the final clause {{d}} is moved inside the
disjunction, yielding {{BLOCK(a,or(BLOCK(b,c,d),BLOCK(b,d)))}}. Checking for
common prefixes means that intervals of the form
{{BLOCK(a,or(BLOCK(b,c),d),e)}} don't need to be rewritten, which will be more
efficient when the query is run as we only need to iterate positions for the
final term once.
> Improve handling of inner disjunctions in intervals
> ---------------------------------------------------
>
> Key: LUCENE-8477
> URL: https://issues.apache.org/jira/browse/LUCENE-8477
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8477.patch
>
>
> The current implementation of the disjunction interval produced by
> {{Intervals.or}} is a direct implementation of the OR operator from the Vigna
> paper. This produces minimal intervals, meaning that (a) is preferred over
> (a b), and (b) also over (a b). This has advantages when it comes to
> counting intervals for scoring, but also has drawbacks when it comes to
> matching. For example, a phrase query for ((a OR (a b)) BLOCK (c)) will not
> match the document (a b c), because (a) will be preferred over (a b), and (a
> c) does not match.
> This ticket is to discuss the best way of dealing with disjunctions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]