[
https://issues.apache.org/jira/browse/TIKA-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083128#comment-15083128
]
Tim Allison commented on TIKA-1822:
-----------------------------------
When we can't get the ID for a linked object via POI's {{CharacterRun mscr =
field.getMarkSeparatorCharacterRun(r);}}, should we add an annotation for an
unknown id (e.g. {{<div class="embedded" id="_UNKNOWN_ID" />}}) or should we
skip adding an annotation?
> NullPointerException when parsing a .doc file
> ---------------------------------------------
>
> Key: TIKA-1822
> URL: https://issues.apache.org/jira/browse/TIKA-1822
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.8
> Environment: Linux
> Reporter: Panagiotis Mpailis
> Assignee: Tim Allison
> Attachments: npe_example.doc
>
>
> We are using Tika 1.11 to extract text from msword documents, and there are a
> few errors occurring when processing some docs.
> This ticket relates to https://issues.apache.org/jira/browse/TIKA-1733
> however in this case there is an unexpected NullPointerException and not a
> clear indication of the error.
> Processing a saved copy of the document solves the error altogether. A
> difference found between the two documents was that the
> _(HWPFDocument)document.getRange()_ returned different values.
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException:
> Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@58a306e2
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.Tika.parseToString(Tika.java:496)
> at org.apache.tika.Tika.parseToString(Tika.java:610)
> Caused by: java.lang.NullPointerException
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:311)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:169)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 10 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)