[
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314869#comment-16314869
]
Robert Muir commented on LUCENE-8118:
-------------------------------------
Dawid it is not complicated in this case. It is *trivial* to fix.
Again to explain:
* With *addDocument* you don't hit OOM and you dont need a huge heap. just keep
indexing documents and lucene will flush to disk appropriately.
* With *addDocumentS* it will try to add anything you pass all atomically as
one "transaction".
There are a couple problems here. First is the method's name (addDocuments is
*not* the plural form of addDocument, its something totally different
alltogether. It needs to be addDocumentsAtomic or addDocumentsBlock or
something else, anything else. Its also missing bounds checks which is why you
see the AIOOBE, those need to be added.
> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -----------------------------------------------------------------------------
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"
>
> Java(TM) SE Runtime
> Environment (build 1.8.0_144-b01)
>
> Java HotSpot(TM) 64-Bit Server VM (build
> 25.144-b01, mixed mode)
> Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents
> results in an ArrayIndexOutOfBoundsException in
> org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace
> below).
> The bug is possibly related to issues described in
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
> and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from [GitHub
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>
> - compile with `mvn compile assembly:single`
> - run with `java -cp
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536
> at
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>
> at
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>
> at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
>
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>
> at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>
> at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>
> at
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]