[ https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313318#comment-16313318 ]
Robert Muir commented on LUCENE-8118: ------------------------------------- Well, I understand the bug, but not sure what the fix is. Indexing code implements Iterable etc to pull in the docs, and makes one single call to addDocuments(). This is supposed to be an "atomic add" of multiple documents at once which gives certain guarantees: needed for nested documents and features like that so they document IDs will be aligned in a particular way. In your case, its too much data, IndexWriter isn't going to be able to do 200M docs in one operation like this. > ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing > ----------------------------------------------------------------------------- > > Key: LUCENE-8118 > URL: https://issues.apache.org/jira/browse/LUCENE-8118 > Project: Lucene - Core > Issue Type: Bug > Components: core/index > Affects Versions: 7.2 > Environment: Debian/Stretch > java version "1.8.0_144" > > Java(TM) SE Runtime > Environment (build 1.8.0_144-b01) > > Java HotSpot(TM) 64-Bit Server VM (build > 25.144-b01, mixed mode) > Reporter: Laura Dietz > > Indexing a large collection of about 20 million paragraph-sized documents > results in an ArrayIndexOutOfBoundsException in > org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace > below). > The bug is possibly related to issues described in > [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html] > and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I > am not using SOLR, I am directly using Lucene Core. > The issue can be reproduced using code from [GitHub > trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example] > > - compile with `mvn compile assembly:single` > - run with `java -cp > ./target/treccar-tools-example-0.1-jar-with-dependencies.jar > edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir` > Where paragraphCorpus.cbor is contained in this > [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) > > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) > > at > org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) > > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) > > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) > > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) > > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) > > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) > > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508) > at > edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org