Luís Filipe Nassif created LUCENE-10681:
-------------------------------------------

             Summary: ArrayIndexOutOfBoundsException while indexing large 
binary file
                 Key: LUCENE-10681
                 URL: https://issues.apache.org/jira/browse/LUCENE-10681
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/index
    Affects Versions: 9.2
         Environment: Linux Ubuntu (will check the user version), java x64 
version 11.0.16.1
            Reporter: Luís Filipe Nassif


Hello,

I looked for a similar issue, but didn't find one, so I'm creating this, sorry 
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and 
an user reported error below while indexing a huge binary file in a 
parent-children schema where strings extracted from the huge binary file (using 
strings command) are indexed as thousands of ~10MB children docs of the parent 
metadata document:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds 
for length 71428
    at 
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
 ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at 
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
romseygeek - 2022-05-19 15:10:13]
    at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
~[iped-engine-4.0.2.jar:?]
    at 
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
~[iped-engine-4.0.2.jar:?]

 

This seems an integer overflow to me, not sure... It didn't use to happen with 
previous lucene-5.5.5 and indexing files like this is pretty common to us, 
although with lucene-5.5.5 we used to break that huge file manually before 
indexing using IndexWriter.addDocument(Document) method several times for each 
10MB chunck, now we are using the IndexWriter.addDocuments(Iterable) method 
with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to