[
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820013#comment-13820013
]
Nick Burch commented on TIKA-1194:
----------------------------------
I've had a quick look, and WordExtractor from Apache POI skips the text too
My first hunch would be that it's something to do with text fields
Any chance you could step through the parser in a debugger, checking the text
of the ranges around the point of the missing text, and see if there's anything
odd going on?
> Missing text from MS Word (DOC) file
> ------------------------------------
>
> Key: TIKA-1194
> URL: https://issues.apache.org/jira/browse/TIKA-1194
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Tomas Safarik
> Priority: Critical
> Attachments: OP-06-015.doc
>
>
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line
> (in table cell) in the original document.
> - If you add or remove one character anywhere before the problematic
> line/cell then the filtered text is correct. If you get the text back to
> original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is
> needed.
> Regards,
> Tomas
--
This message was sent by Atlassian JIRA
(v6.1#6144)