[ 
https://issues.apache.org/jira/browse/LUCENE-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970622#comment-16970622
 ] 

Ilan Ginzburg edited comment on LUCENE-9037 at 11/8/19 10:47 PM:
-----------------------------------------------------------------

Thanks [~mikemccand].

What about moving up the call to 
{{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the 
bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in 
{{DocumentsWriter.updateDocument/s()}}?
 Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do 
after _successful or failed_ document".

Exploring that path see if I can make it work (and existing tests pass).

Your suggestion of throwing a meaningful exception upon reaching the limit 
would not help my use case if there's no flush happening as a consequence.


was (Author: murblanc):
Thanks [~mikemccand].

What about moving up the call to 
{{DocumentsWriterFlushControl.doAfterDocument()}} into the {{finally}} of the 
bloc calling {{DocumentsWriterPerThread.updateDocument/s()}} in 
{{DocumentsWriter.updateDocument/s()}}?
Basically consider {{DocumentsWriterFlushControl.doAfterDocument()}} as a "do 
after _successful or failed_ document".

Exploring that path see if I can make it work (and existing tests pass).

> ArrayIndexOutOfBoundsException due to repeated IOException during indexing
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-9037
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9037
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 7.1
>            Reporter: Ilan Ginzburg
>            Priority: Minor
>         Attachments: TestIndexWriterTermsHashOverflow.java
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a limit to the number of tokens that can be held in memory by Lucene 
> when docs are indexed using DocumentsWriter, then bad things happen. The 
> limit can be reached by submitting a really large document, by submitting a 
> large number of documents without doing a commit (see LUCENE-8118) or by 
> repeatedly submitting documents that fail to get indexed in some specific 
> ways, leading to Lucene not cleaning up the in memory data structures that 
> eventually overflow.
> The overflow is due to a 32 bit (signed) integer wrapping around to negative 
> territory, then causing an ArrayIndexOutOfBoundsException. 
> The failure path that we are reliably hitting is due to an IOException during 
> doc tokenization. A tokenizer implementing TokenStream throws an exception 
> from incrementToken() which causes indexing of that doc to fail. 
> The IOException bubbles back up to DocumentsWriter.updateDocument() (or 
> DocumentsWriter.updateDocuments() in some other cases) where it is not 
> treated as an AbortingException therefore it is not causing a reset of the 
> DocumentsWriterPerThread. On repeated failures (without any successful 
> indexing in between) if the upper layer (client via Solr) resubmits the doc 
> that fails again, DocumentsWriterPerThread will eventually cause 
> TermsHashPerField data structures to grow and overflow, leading to an 
> exception stack similar to the one in LUCENE-8118 (below stack trace copied 
> from a test run repro on 7.1):
> java.lang.ArrayIndexOutOfBoundsException: 
> -65536java.lang.ArrayIndexOutOfBoundsException: -65536
>  at __randomizedtesting.SeedInfo.seed([394FAB2B91B1D90A:C86FB3F3CE001AA8]:0) 
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>  at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
>  at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
>  at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:792)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>  at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>  at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
>  at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:481)
>  at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1717) 
> at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1462)
> Using tokens composed only of lowercase letters, it takes less than 
> 130,000,000 different tokens (the shortest ones) to overflow 
> TermsHashPerField.
> Using a single document (composed of the 20,000 shortest lowercase tokens) 
> submitted repeatedly for indexing requires 6352 submissions all failing with 
> an IOException on incrementToken() to trigger the 
> ArrayIndexOutOfBoundsException.
> A proposed fix is to treat in DocumentsWriter.updateDocument() and 
> DocumentsWriter.updateDocuments() an IOException in the same way we treat an 
> AbortingException.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to