[
https://issues.apache.org/jira/browse/LUCENE-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1072:
---------------------------------------
Attachment: LUCENE-1072.patch
Attached patch. I plan to commit in a day or two.
I added a unit test showing that indeed DocumentsWriter becomes
unusable once it's hit a "too long term", then fixed the issue so the
unit test passes.
Now, if we encounter too-long terms in the doc we skip those terms but
continue indexing the other acceptable terms from the doc, then throw
the IllegalArgumentException at the end after processing the full
document. So it's now "ok" to catch & ignore this exception though
clearly in general you should address its root cause so you don't
accidentally pollute your term dictionary (see LUCENE-1052, as Grant
suggested, once that happens!).
> NullPointerException during indexing in
> DocumentsWriter$ThreadState$FieldData.addPosition
> -----------------------------------------------------------------------------------------
>
> Key: LUCENE-1072
> URL: https://issues.apache.org/jira/browse/LUCENE-1072
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: 2.3
> Environment: Linux CentOS 5 x86_64 running on 2-core Pentium D, Java
> HotSpot(TM) 64-Bit Server VM (build 1.6.0_01-b06, mixed mode), using
> lucene-core-2007-11-29_02-49-31
> Reporter: Alexei Dets
> Assignee: Michael McCandless
> Attachments: LUCENE-1072.patch
>
>
> In my case during indexing sometimes appear documents with unusually large
> "words" - text-encoded images in fact.
> Attempt to add document that contains field with such token produces
> java.lang.IllegalArgumentException:
> java.lang.IllegalArgumentException: term length 37944 exceeds max term length
> 16383
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.addPosition(DocumentsWriter.java:1492)
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(DocumentsWriter.java:1321)
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(DocumentsWriter.java:1247)
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(DocumentsWriter.java:972)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2202)
> at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2186)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1432)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)
> This is expected, exception is caught and ignored. The problem is that after
> this IndexWriter becomes somewhat corrupted and subsequent attempts to add
> documents to the index fail as well, this time with NPE:
> java.lang.NullPointerException
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.addPosition(DocumentsWriter.java:1497)
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(DocumentsWriter.java:1321)
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(DocumentsWriter.java:1247)
> at
> org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(DocumentsWriter.java:972)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2202)
> at
> org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2186)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1432)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)
> This is 100% reproducible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]