I've got a user getting java.lang.IndexOutOfBoundsException from the
UnifiedHighlighter in Solr 9.1.0 w/Lucene 9.3.0
(And FWIW, this same data, w/same configs, in 8.11.1, purportedtly didn't
have this problem)
I don't really understand the highlighter code very well, but AFAICT:
- DefaultPassageFormatter seems to assume that the "matches"
inside a single Passage will be "in order" (by offset)
- it accounts for the possibility that they overlap
- but not that matchEnds[i+1] < matchStarts[i]
- but in some cases (i don't understand)
- TermVectorOffsetStrategy can produce Passages that are "reversed"
- aparently based on the iteration order from
OfMatchesIteratorWithSubs ?
Which means DefaultPassageFormatter can trigger IOOBE in StringBuilder..
java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at java.lang.AbstractStringBuilder.checkRange(Unknown Source) ~[?:?]
at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:?]
at java.lang.StringBuilder.append(Unknown Source) ~[?:?]
at
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:133)
~[?:?]
at
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:84)
~[?:?]
at
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:25)
~[?:?]
at
org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:94)
~[?:?]
at
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:954)
~[?:?]
at
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:824)
~[?:?]
at
org.apache.solr.highlight.UnifiedSolrHighlighter.doHighlighting(UnifiedSolrHighlighter.java:165)
~[?:?]
...as it tries to append a subsequence based on the start+end of
"overlapping" matches that don't actaully overlap -- the end of the
"i+1" match is just strictly less then the "start" of the "i"
match because of how the Passage was build
I'm still trying to wrap my head around all the moving pieces to
try and reproduce this in a small scale lucene test, but in the meantime I
patched some of the 9.3.0 highlighter code (patch below sig) to include
some debugging output to kind of show what's happening here...
http://localhost:8983/solr/workplace/select?fl=Expertise,id&defType=lucene&df=Expertise&q=machine+learning&hl=true&rows=1&q.op=OR&echoParams=all
nocommit: highlightOffsetsEnums ->
OfMatchesIteratorWithSubs(term:learning,[8-16])
nocommit: Passage2030658055.addMatch(8,16,[6c 65 61 72 6e 69 6e 67],1)
nocommit: highlightOffsetsEnums -> OfMatchesIteratorWithSubs(term:machine,[0-7])
nocommit: Passage2030658055.addMatch(0,7,[6d 61 63 68 69 6e 65],1)
nocommit:
format([[Passage[0-16]{learning[8-16],machine[0-7]}score=2.7656934]],Machine
Learning) <-- class org.apache.lucene.search.uhighlight.TermVectorOffsetStrategy
nocommit: append(,Machine Learning,0,8)
nocommit: append(Machine <em>,Machine Learning,8,7)
2023-06-29 21:11:15.711 ERROR (qtp1528769018-17) [ x:workplace]
o.a.s.h.RequestHandlerBase java.lang.IndexOutOfBoundsException: start 8, end 7,
length 16 => java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at
java.base/java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716)
java.lang.IndexOutOfBoundsException: start 8, end 7, length 16
at
java.lang.AbstractStringBuilder.checkRange(AbstractStringBuilder.java:1716)
~[?:?]
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:631) ~[?:?]
at java.lang.StringBuilder.append(StringBuilder.java:217) ~[?:?]
at
org.apache.lucene.search.uhighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:134)
~[?:?]
..note how the OfMatchesIteratorWithSubs (OffsetEnum) enumerates over the
two terms in this order...
term:learning,[8-16]
term:machine,[0-7]
...and that order is preserved in the final Passage -- leading
DefaultPassageFormatter.format() to decide that the two matches in this
Passage overlap (because the start of match#1 (machine[0-7]) is less then
the end of match#0 (learning[8-16]) ... but they don't overlap, one is
strictly before the other, so it winds up passing StringBuilder.append an
end < start.
* Has anyone seen any failures like this ?
* Is this a bug in DefaultPassageFormatter's assumptions,
or in the ordering produced by the OffsetEnum ?
* Does anyone have a theory where/how the problem might have changed
between 8.11 and 9.3 ?
-Hoss
http://www.lucidworks.com/
diff --git
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
index 345e2b61316..c82362b5eac 100644
---
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
+++
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java
@@ -102,6 +102,7 @@ public class DefaultPassageFormatter extends
PassageFormatter {
* @param end index of the character following the last character in content
*/
protected void append(StringBuilder dest, String content, int start, int
end) {
+ System.err.println("nocommit:
append("+dest+","+content+","+start+","+end+")");
if (escape) {
// note: these are the rules from owasp.org
for (int i = start; i < end; i++) {
diff --git
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
index aacb9089e91..eba4e2a6082 100644
---
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
+++
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
@@ -91,6 +91,7 @@ public class FieldHighlighter {
}
if (passages.length > 0) {
+ System.err.println("nocommit:
format(["+java.util.Arrays.toString(passages)+"],"+content+") <-- "+
fieldOffsetStrategy.getClass());
return passageFormatter.format(passages, content);
} else {
return null;
@@ -152,6 +153,8 @@ public class FieldHighlighter {
int lastPassageEnd = 0;
do {
+ System.err.println("nocommit: highlightOffsetsEnums -> " +
off.toString());
+
int start = off.startOffset();
if (start == -1) {
throw new IllegalArgumentException(
diff --git
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
index 6fa281bb16c..09cd89dc14b 100644
---
a/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
+++
b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java
@@ -41,6 +41,8 @@ public class Passage {
/** @lucene.internal */
public void addMatch(int startOffset, int endOffset, BytesRef term, int
termFreqInDoc) {
+ System.err.println("nocommit:
Passage"+System.identityHashCode(this)+".addMatch("+startOffset+","+endOffset+","+term+","+termFreqInDoc+")");
+
assert startOffset >= this.startOffset && startOffset <= this.endOffset;
if (numMatches == matchStarts.length) {
int newLength = ArrayUtil.oversize(numMatches + 1,
RamUsageEstimator.NUM_BYTES_OBJECT_REF);
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org