[
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314698#comment-16314698
]
Dawid Weiss commented on LUCENE-8118:
-------------------------------------
OOMs are complicated in general because once you hit one, there's a very real
risk that you won't be able to recover anyway (even constructing a new
exception message typically requires memory allocation and this just goes on
and on in a vicious cycle). I remember thinking about it a lot in the early
days of randomizedrunner, but without any constructive conclusions. I tried
preallocating stuff in advance (not possible in all cases) and workarounds like
keeping a memory buffer that is made reclaimable on OOM (so that there's some
memory available before we hit the next one)... these are hacks more than
solutions and they don't always work anyway (as in when you have background
heap-competing threads...).
I like Java, but it starts to show its wrinkles. :(
> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -----------------------------------------------------------------------------
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"
>
> Java(TM) SE Runtime
> Environment (build 1.8.0_144-b01)
>
> Java HotSpot(TM) 64-Bit Server VM (build
> 25.144-b01, mixed mode)
> Reporter: Laura Dietz
> Attachments: LUCENE-8118_test.patch
>
>
> Indexing a large collection of about 20 million paragraph-sized documents
> results in an ArrayIndexOutOfBoundsException in
> org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace
> below).
> The bug is possibly related to issues described in
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
> and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from [GitHub
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>
> - compile with `mvn compile assembly:single`
> - run with `java -cp
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536
> at
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>
> at
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>
> at
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
>
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>
> at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>
> at
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>
> at
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
> at
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]