[
https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dustin Spicuzza updated TIKA-2459:
----------------------------------
Attachment: foo2.doc
> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
> Key: TIKA-2459
> URL: https://issues.apache.org/jira/browse/TIKA-2459
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16
> Environment: Windows and Linux
> Reporter: Dustin Spicuzza
> Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via
> org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get
> extracted by Tika. The 'paragraph one' paragraph is present in the POI
> extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
> Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One: Else
> Two: Here
> Three: Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
> Somebody else
> Something here too
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)