[ 
https://issues.apache.org/jira/browse/LUCENE-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780319#comment-16780319
 ] 

Adrien Grand commented on LUCENE-8614:
--------------------------------------

+1 to check for overflows and raise a better error

Maybe we can write a test that uses reasonable amounts of memory by using a 
dummy allocator that always returns the same byte[].

> ArrayIndexOutOfBoundsException in ByteBlockPool
> -----------------------------------------------
>
>                 Key: LUCENE-8614
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8614
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 7.5
>            Reporter: Igor Motov
>            Priority: Major
>         Attachments: LUCENE-8614.patch
>
>
> A field with a very large number of small tokens can cause 
> ArrayIndexOutOfBoundsException in ByteBlockPool due to an arithmetic overflow 
> in ByteBlockPool.
> The issue was originally reported in 
> [https://github.com/elastic/elasticsearch/issues/23670] where due to the 
> indexing settings the geo_shape generated a very large number of tokens and 
> caused the indexing operation to fail with the following exception: 
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -65531
>       at 
> org.apache.lucene.util.ByteBlockPool.setBytesRef(ByteBlockPool.java:308) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at org.apache.lucene.util.BytesRefHash.equals(BytesRefHash.java:183) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at org.apache.lucene.util.BytesRefHash.findHash(BytesRefHash.java:337) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:255) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:766)
>  ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
>  ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
>  ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
>  ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
>  ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1575) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
>       at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1320) 
> ~[lucene-core-6.4.0.jar:6.4.0 bbe4b08cc1fb673d0c3eb4b8455f23ddc1364124 - jim 
> - 2017-01-17 15:57:29]
> {noformat}
> I was able to reproduce the issue and somewhat reduce the test that 
> reproduces it (see enclosed patch) but unfortunately it still requires 12G of 
> heap to run.
> The issue seems to be caused by arithmetic overflow in the {{byteOffset}} 
> calculation when {{BytesBlockPool}} advances to the next buffer on the last 
> line of the 
> [nextBuffer()|https://github.com/apache/lucene-solr/blob/e386ec973b8a4ec2de2bfc43f51df511a365d60f/lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java#L207]
>  method, but it doesn't manifest itself until much later when this offset is 
> used to calculate the 
> [bytesStart|https://github.com/apache/lucene-solr/blob/e386ec973b8a4ec2de2bfc43f51df511a365d60f/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L277]
>  in {{BytesRefHash}}, which in turn causes AIOB back in the {{ByteBlockPool}} 
> [setBytesRef()|https://github.com/apache/lucene-solr/blob/e386ec973b8a4ec2de2bfc43f51df511a365d60f/lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java#L308]
>  method where it is used to find the term's buffer.
> I realize that it's unreasonable to expect lucene to index such fields, but I 
> wonder if an overflow check should be added to {{BytesBlockPool.nextBuffer}} 
> in order to handle such condition more gracefully. 
>   
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to