[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

Christoph Goller (JIRA) Wed, 03 Aug 2016 06:18:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405903#comment-15405903
 ]


Christoph Goller commented on LUCENE-7398:
------------------------------------------

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. It we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himnself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposion Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
    matchStart = subSpans[subSpans.length - 1].start();
    matchEnd = subSpans[subSpans.length - 1].end();
    Set<byte[]> possibleMatchPayloads = new HashSet<>();
    if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
      possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
    }

    Collection<byte[]> possiblePayload = null;
    
    int matchSlop = 0;
    int lastStart = matchStart;
    int lastEnd = matchEnd;
    for (int i = subSpans.length - 2; i >= 0; i--) {
      Spans prevSpans = subSpans[i];
      if (collectPayloads && prevSpans.isPayloadAvailable()) {
        Collection<byte[]> payload = prevSpans.getPayload();
        possiblePayload = new ArrayList<>(payload.size());
        possiblePayload.addAll(payload);
      }
      
      int prevStart = prevSpans.start();
      int prevEnd = prevSpans.end();
      while (true) { // Advance prevSpans until after (lastStart, lastEnd)
        if (! prevSpans.next()) {
          inSameDoc = false;
          more = false;
          break; // Check remaining subSpans for final match.
        } else if (matchDoc != prevSpans.doc()) {
          inSameDoc = false; // The last subSpans is not advanced here.
          break; // Check remaining subSpans for last match in this document.
        } else {
          int ppStart = prevSpans.start();
          int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
          if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
            break; // Check remaining subSpans.
          } else { // prevSpans still before (lastStart, lastEnd)
            prevStart = ppStart;
            prevEnd = ppEnd;
            if (collectPayloads && prevSpans.isPayloadAvailable()) {
              Collection<byte[]> payload = prevSpans.getPayload();
              possiblePayload = new ArrayList<>(payload.size());
              possiblePayload.addAll(payload);
            }
          }
        }
      }

      if (collectPayloads && possiblePayload != null) {
        possibleMatchPayloads.addAll(possiblePayload);
      }
      
      assert prevStart <= matchStart;
      if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
        matchSlop += (matchStart - prevEnd);
      }

      /* Do not break on (matchSlop > allowedSlop) here to make sure
       * that subSpans[0] is advanced after the match, if any.
       */
      matchStart = prevStart;
      lastStart = prevStart;
      lastEnd = prevEnd;
    }
    
    boolean match = matchSlop <= allowedSlop;
    
    if(collectPayloads && match && possibleMatchPayloads.size() > 0) {
      matchPayload.addAll(possibleMatchPayloads);
    }

    return match; // ordered and allowed slop
  }
  
{code}

Unfortunately the provided patch does not solve the problem. It only reorders 
span-matches of a SpanOrQuery in a special case, since it cannot control the 
order of span-matches of its subspans. I consider the patch as potentially 
dangerous since SpanOrQuery with the patch provides an ordering of span-matches 
that differs form the general contract that holds for all spans.



> Nested Span Queries are buggy
> -----------------------------
>
>                 Key: LUCENE-7398
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7398
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 5.5, 6.x
>            Reporter: Christoph Goller
>            Assignee: Alan Woodward
>            Priority: Critical
>         Attachments: LUCENE-7398.patch, LUCENE-7398.patch, 
> TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

Reply via email to