[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405903#comment-15405903
]
Christoph Goller commented on LUCENE-7398:
------------------------------------------
After thoroughly looking into SpanQueries my conclusion is, that we have a
fundamental problem in the implementation of SpanNearQuery. The problem is not
new, it probably existed already in the first version of SpanQueries which as
far as I know were implemented by Doug Cutting himself. I remember some
attempts to describe in which cases SpanQueries work correctly and in which
they do not (discussions about overlapping), but those explanations and
definitions were never completely convincing for me.
My best guess: NearSpansOrdered and NearSpansUnordered currently are only
correct if for each clause of the SpanQuery we can guarantee, that all its
matches have the same length. In this case it is clear that (for the ordered
case) if a match is too long (sloppy) we can skip to the first clause and call
nextPosition. No alternative matches of intermediate clauses could improve the
overall match. It we have clauses with varying match length (SpanOr or SpanNear
with sloppyness) we would have to backtrack to intermediate clauses and check
whether there are e.g. longer matches that could reduce the overall match
length. Pauls last test case shows that even a match of the second clause that
advances its position can reduce the overall lenght if it is longer himnself. A
match of an intermediate clause at an advanced position could be considerably
shorter than its first match requiring a reset of the spans of following
clauses. To my opinion this bug can only be fixed by implementing a
backtracking search on the subspans that also requires a limited possibilitxy
to reposion Spans to previous positions.
By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4
provided a kind of backtracking which was the reason why my queries worked in
elasticsearch 1.7.x. However, I think the implementation also did not solve all
cases:
{code}
/** The subSpans are ordered in the same doc, so there is a possible match.
* Compute the slop while making the match as short as possible by advancing
* all subSpans except the last one in reverse order.
*/
private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set<byte[]> possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}
Collection<byte[]> possiblePayload = null;
int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
Spans prevSpans = subSpans[i];
if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection<byte[]> payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
}
int prevStart = prevSpans.start();
int prevEnd = prevSpans.end();
while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
inSameDoc = false;
more = false;
break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
inSameDoc = false; // The last subSpans is not advanced here.
break; // Check remaining subSpans for last match in this document.
} else {
int ppStart = prevSpans.start();
int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
} else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection<byte[]> payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
}
}
}
}
if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
}
assert prevStart <= matchStart;
if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
}
/* Do not break on (matchSlop > allowedSlop) here to make sure
* that subSpans[0] is advanced after the match, if any.
*/
matchStart = prevStart;
lastStart = prevStart;
lastEnd = prevEnd;
}
boolean match = matchSlop <= allowedSlop;
if(collectPayloads && match && possibleMatchPayloads.size() > 0) {
matchPayload.addAll(possibleMatchPayloads);
}
return match; // ordered and allowed slop
}
{code}
Unfortunately the provided patch does not solve the problem. It only reorders
span-matches of a SpanOrQuery in a special case, since it cannot control the
order of span-matches of its subspans. I consider the patch as potentially
dangerous since SpanOrQuery with the patch provides an ordering of span-matches
that differs form the general contract that holds for all spans.
> Nested Span Queries are buggy
> -----------------------------
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 5.5, 6.x
> Reporter: Christoph Goller
> Assignee: Alan Woodward
> Priority: Critical
> Attachments: LUCENE-7398.patch, LUCENE-7398.patch,
> TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]