[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

Nick Burch (JIRA) Tue, 12 Nov 2013 03:08:44 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820013#comment-13820013
 ]


Nick Burch commented on TIKA-1194:
----------------------------------

I've had a quick look, and WordExtractor from Apache POI skips the text too

My first hunch would be that it's something to do with text fields

Any chance you could step through the parser in a debugger, checking the text 
of the ranges around the point of the missing text, and see if there's anything 
odd going on?

> Missing text from MS Word (DOC) file
> ------------------------------------
>
>                 Key: TIKA-1194
>                 URL: https://issues.apache.org/jira/browse/TIKA-1194
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Tomas Safarik
>            Priority: Critical
>         Attachments: OP-06-015.doc
>
>
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line 
> (in table cell) in the original document.
> - If you add or remove one character anywhere before the problematic 
> line/cell then the filtered text is correct. If you get the text back to 
> original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is 
> needed.
> Regards,
> Tomas



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

Reply via email to