[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581478#comment-17581478
 ] 

Luís Filipe Nassif edited comment on LUCENE-8118 at 8/18/22 6:37 PM:
---------------------------------------------------------------------

Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a known upper bound for numDocs x docSize given to 
addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min

PS3: I changed our application from addDocument() to addDocumentS() in part 
because of the nice atomic guarantees and because we have to have all text 
chunks children of one parent document. If we have to call addDocumentS() 
multiple times with smaller iterables, possibly we will have to implement the 
parent-children control by ourselves (as we did in the past with the first 
method)... or not?


was (Author: lfcnassif):
Hi, a colleague of mine pointed this to me. Should I close 
https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with ~10MB 
sized docs. Is there a known upper bound for numDocs x docSize given to 
addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other 
threads

PS2: our default commit time interval is 30min

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8118
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8118
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 7.2
>         Environment: Debian/Stretch
> java version "1.8.0_144"                                                      
>                                                                               
>                                                    Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01)                                             
>                                                                               
>                                Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>            Reporter: Laura Dietz
>            Priority: Major
>         Attachments: LUCENE-8118_test.patch
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
>                                                                         at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>                                                                               
>                                                at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>                                                                               
>                                                at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>                                                                               
>                              at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)     
>                                                                               
>                                                 at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>                                                                               
>                                    at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>                                                                               
>                                       at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>                                                                               
>                                    at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>                                                                               
>                            at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>                                                                               
>                                              at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)    
>                                                                               
>                                                 at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
>         at 
> edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to