[jira] [Updated] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Dustin Spicuzza (JIRA) Tue, 05 Sep 2017 16:01:50 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dustin Spicuzza updated TIKA-2459:
----------------------------------
    Attachment: foo2.doc

> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
>                 Key: TIKA-2459
>                 URL: https://issues.apache.org/jira/browse/TIKA-2459
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: Windows and Linux
>            Reporter: Dustin Spicuzza
>         Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via 
> org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get 
> extracted by Tika. The 'paragraph one' paragraph is present in the POI 
> extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One:    Else
> Two:    Here
> Three:  Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Reply via email to