[
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luís Filipe Nassif updated LUCENE-10681:
----------------------------------------
Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 (was: Linux
Ubuntu (will check the user version), java x64 version 11.0.16.1)
> ArrayIndexOutOfBoundsException while indexing large binary file
> ---------------------------------------------------------------
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
> Reporter: Luís Filipe Nassif
> Priority: Minor
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this,
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0
> recently and an user reported error below while indexing a huge binary file
> in a parent-children schema where strings extracted from the huge binary file
> (using strings command) are indexed as thousands of ~10MB children text docs
> of the parent metadata document:
>
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of
> bounds for length 71428
> at
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at iped.engine.task.index.IndexTask.process(IndexTask.java:148)
> ~[iped-engine-4.0.2.jar:?]
> at
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250)
> ~[iped-engine-4.0.2.jar:?]{noformat}
>
> This seems an integer overflow to me, not sure... It didn't use to happen
> with previous lucene-5.5.5 and indexing files like this is pretty common to
> us, although with lucene-5.5.5 we used to break that huge file manually
> before indexing and to index using IndexWriter.addDocument(Document) method
> several times for each 10MB chunk, now we are using the
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]