[
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luís Filipe Nassif updated LUCENE-10681:
----------------------------------------
Description:
Hello,
I looked for a similar issue, but didn't find one, so I'm creating this, sorry
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and
an user reported error below while indexing a huge binary file in a
parent-children schema where strings extracted from the huge binary file (using
strings command) are indexed as thousands of ~10MB children text docs of the
parent metadata document:
{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds
for length 71428
at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]
at iped.engine.task.index.IndexTask.process(IndexTask.java:148)
~[iped-engine-4.0.2.jar:?]
at
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250)
~[iped-engine-4.0.2.jar:?]{noformat}
This seems an integer overflow to me, not sure... It didn't use to happen with
previous lucene-5.5.5 and indexing files like this is pretty common to us,
although with lucene-5.5.5 we used to break that huge file manually before
indexing and to index using IndexWriter.addDocument(Document) method several
times for each 10MB chunk, now we are using the
IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?
was:
Hello,
I looked for a similar issue, but didn't find one, so I'm creating this, sorry
if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 recently and
an user reported error below while indexing a huge binary file in a
parent-children schema where strings extracted from the huge binary file (using
strings command) are indexed as thousands of ~10MB children text docs of the
parent metadata document:
{{Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of
bounds for length 71428}}
{{ at
org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at
org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503)
~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
romseygeek - 2022-05-19 15:10:13]}}
{{ at iped.engine.task.index.IndexTask.process(IndexTask.java:148)
~[iped-engine-4.0.2.jar:?]}}
{{ at
iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250)
~[iped-engine-4.0.2.jar:?]}}
This seems an integer overflow to me, not sure... It didn't use to happen with
previous lucene-5.5.5 and indexing files like this is pretty common to us,
although with lucene-5.5.5 we used to break that huge file manually before
indexing and to index using IndexWriter.addDocument(Document) method several
times for each 10MB chunk, now we are using the
IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?
> ArrayIndexOutOfBoundsException while indexing large binary file
> ---------------------------------------------------------------
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 9.2
> Environment: Linux Ubuntu (will check the user version), java x64
> version 11.0.16.1
> Reporter: Luís Filipe Nassif
> Priority: Minor
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this,
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0
> recently and an user reported error below while indexing a huge binary file
> in a parent-children schema where strings extracted from the huge binary file
> (using strings command) are indexed as thousands of ~10MB children text docs
> of the parent metadata document:
>
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of
> bounds for length 71428
> at
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503)
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b -
> romseygeek - 2022-05-19 15:10:13]
> at iped.engine.task.index.IndexTask.process(IndexTask.java:148)
> ~[iped-engine-4.0.2.jar:?]
> at
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250)
> ~[iped-engine-4.0.2.jar:?]{noformat}
>
> This seems an integer overflow to me, not sure... It didn't use to happen
> with previous lucene-5.5.5 and indexing files like this is pretty common to
> us, although with lucene-5.5.5 we used to break that huge file manually
> before indexing and to index using IndexWriter.addDocument(Document) method
> several times for each 10MB chunk, now we are using the
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]