TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

Chris Hostetter Thu, 29 Jun 2023 14:55:52 -0700

I've got a user getting java.lang.IndexOutOfBoundsException from theUnifiedHighlighter in Solr 9.1.0 w/Lucene 9.3.0

(And FWIW, this same data, w/same configs, in 8.11.1, purportedtly didn'thave this problem)



I don't really understand the highlighter code very well, but AFAICT:

- DefaultPassageFormatter seems to assume that the "matches"
  inside a single Passage will be "in order" (by offset)
  - it accounts for the possibility that they overlap
  - but not that matchEnds[i+1] < matchStarts[i]
- but in some cases (i don't understand)
  - TermVectorOffsetStrategy can produce Passages that are "reversed"
  - aparently based on the iteration order from
    OfMatchesIteratorWithSubs ?

Which means DefaultPassageFormatter can trigger IOOBE in StringBuilder..

java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
  at java.lang.AbstractStringBuilder.checkRange(Unknown Source) ~[?:?]
  at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:?]
  at java.lang.StringBuilder.append(Unknown Source) ~[?:?]
  at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:133)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:84)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:25)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:94)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:954)
 ~[?:?]
  at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:824)
 ~[?:?]
  at 
org.apache.solr.highlight.UnifiedSolrHighlighter.doHighlighting(UnifiedSolrHighlighter.java:165)
 ~[?:?]

...as it tries to append a subsequence based on the start+end of"overlapping" matches that don't actaully overlap -- the end of the"i+1" match is just strictly less then the "start" of the "i"

match because of how the Passage was build

I'm still trying to wrap my head around all the moving pieces totry and reproduce this in a small scale lucene test, but in the meantime Ipatched some of the 9.3.0 highlighter code (patch below sig) to includesome debugging output to kind of show what's happening here...


http://localhost:8983/solr/workplace/select?fl=Expertise,id&defType=lucene&df=Expertise&q=machine+learning&hl=true&rows=1&q.op=OR&echoParams=all

nocommit: highlightOffsetsEnums -> 
OfMatchesIteratorWithSubs(term:learning,[8-16])
nocommit: Passage2030658055.addMatch(8,16,[6c 65 61 72 6e 69 6e 67],1)
nocommit: highlightOffsetsEnums -> OfMatchesIteratorWithSubs(term:machine,[0-7])
nocommit: Passage2030658055.addMatch(0,7,[6d 61 63 68 69 6e 65],1)
nocommit: 
format([[Passage[0-16]{learning[8-16],machine[0-7]}score=2.7656934]],Machine 
Learning) <-- class org.apache.lucene.search.uhighlight.TermVectorOffsetStrategy
nocommit: append(,Machine Learning,0,8)
nocommit: append(Machine <em>,Machine Learning,8,7)
2023-06-29 21:11:15.711 ERROR (qtp1528769018-17) [ x:workplace] 
o.a.s.h.RequestHandlerBase java.lang.IndexOutOfBoundsException: start 8, end 7, 
length 16 => java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
        at 
java.base/java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716)
java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
        at 
java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716) 
~[?:?]
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:631) ~[?:?]
        at java.lang.StringBuilder.append(StringBuilder.java:217) ~[?:?]
        at 
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:134)
 ~[?:?]

..note how the OfMatchesIteratorWithSubs (OffsetEnum) enumerates over thetwo terms in this order...


        term:learning,[8-16]
        term:machine,[0-7]

...and that order is preserved in the final Passage -- leadingDefaultPassageFormatter.format() to decide that the two matches in thisPassage overlap (because the start of match#1 (machine[0-7]) is less thenthe end of match#0 (learning[8-16]) ... but they don't overlap, one isstrictly before the other, so it winds up passing StringBuilder.append anend < start.



 * Has anyone seen any failures like this ?
 * Is this a bug in DefaultPassageFormatter's assumptions,
   or in the ordering produced by the OffsetEnum ?
 * Does anyone have a theory where/how the problem might have changed
   between 8.11 and 9.3 ?


-Hoss
http://www.lucidworks.com/



diff --git 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
index 345e2b61316..c82362b5eac 100644
--- 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
+++ 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
@@ -102,6 +102,7 @@ public class DefaultPassageFormatter extends 
PassageFormatter {
    * @param end index of the character following the last character in content
    */
   protected void append(StringBuilder dest, String content, int start, int 
end) {
+    System.err.println("nocommit: 
append("+dest+","+content+","+start+","+end+")");
     if (escape) {
       // note: these are the rules from owasp.org
       for (int i = start; i < end; i++) {
diff --git 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
index aacb9089e91..eba4e2a6082 100644
--- 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
+++ 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
@@ -91,6 +91,7 @@ public class FieldHighlighter {
       }

       if (passages.length > 0) {
+        System.err.println("nocommit: 
format(["+java.util.Arrays.toString(passages)+"],"+content+") <-- "+ 
fieldOffsetStrategy.getClass());
         return passageFormatter.format(passages, content);
       } else {
         return null;
@@ -152,6 +153,8 @@ public class FieldHighlighter {
     int lastPassageEnd = 0;

     do {
+      System.err.println("nocommit: highlightOffsetsEnums -> " + 
off.toString());
+
       int start = off.startOffset();
       if (start == -1) {
         throw new IllegalArgumentException(
diff --git 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
index 6fa281bb16c..09cd89dc14b 100644
--- 
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
+++ 
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
@@ -41,6 +41,8 @@ public class Passage {

   /** @lucene.internal */
   public void addMatch(int startOffset, int endOffset, BytesRef term, int 
termFreqInDoc) {
+    System.err.println("nocommit: 
Passage"+System.identityHashCode(this)+".addMatch("+startOffset+","+endOffset+","+term+","+termFreqInDoc+")");
+
     assert startOffset >= this.startOffset && startOffset <= this.endOffset;
     if (numMatches == matchStarts.length) {
       int newLength = ArrayUtil.oversize(numMatches + 1, 
RamUsageEstimator.NUM_BYTES_OBJECT_REF);

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

TermVectorOffsetStrategy producing Passages with matches out of order? (causing IndexOutOfBoundsException)

Reply via email to