[
https://issues.apache.org/jira/browse/LUCENE-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780319#comment-16780319
]
Adrien Grand commented on LUCENE-8614:
--------------------------------------
+1 to check for overflows and raise a better error
Maybe we can write a test that uses reasonable amounts of memory by using a
dummy allocator that always returns the same byte[].
> ArrayIndexOutOfBoundsException in ByteBlockPool
> -----------------------------------------------
>
> Key: LUCENE-8614
> URL: https://issues.apache.org/jira/browse/LUCENE-8614
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 7.5
> Reporter: Igor Motov
> Priority: Major
> Attachments: LUCENE-8614.patch
>
>
> A field with a very large number of small tokens can cause
> ArrayIndexOutOfBoundsException in ByteBlockPool due to an arithmetic overflow
> in ByteBlockPool.
> The issue was originally reported in
> [https://github.com/elastic/elasticsearch/issues/23670] where due to the
> indexing settings the geo_shape generated a very large number of tokens and
> caused the indexing operation to fail with the following exception:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -65531
> at
> org.apache.lucene.util.ByteBlockPool.setBytesRef(ByteBlockPool.java:308)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at org.apache.lucene.util.BytesRefHash.equals(BytesRefHash.java:183)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at org.apache.lucene.util.BytesRefHash.findHash(BytesRefHash.java:337)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:255)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:766)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1575)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1320)
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim
> - 2017-01-17 15:57:29]
> {noformat}
> I was able to reproduce the issue and somewhat reduce the test that
> reproduces it (see enclosed patch) but unfortunately it still requires 12G of
> heap to run.
> The issue seems to be caused by arithmetic overflow in the {{byteOffset}}
> calculation when {{BytesBlockPool}} advances to the next buffer on the last
> line of the
> [nextBuffer()|https://github.com/apache/lucene-solr/blob/e386ec973b8a4ec2de2bfc43f51df511a365d60f/lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java#L207]
> method, but it doesn't manifest itself until much later when this offset is
> used to calculate the
> [bytesStart|https://github.com/apache/lucene-solr/blob/e386ec973b8a4ec2de2bfc43f51df511a365d60f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L277]
> in {{BytesRefHash}}, which in turn causes AIOB back in the {{ByteBlockPool}}
> [setBytesRef()|https://github.com/apache/lucene-solr/blob/e386ec973b8a4ec2de2bfc43f51df511a365d60f/lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java#L308]
> method where it is used to find the term's buffer.
> I realize that it's unreasonable to expect lucene to index such fields, but I
> wonder if an overflow check should be added to {{BytesBlockPool.nextBuffer}}
> in order to handle such condition more gracefully.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]