Dustin Spicuzza created TIKA-2459:
-------------------------------------
Summary: Missing text in .doc file (but can be extracted by POI)
Key: TIKA-2459
URL: https://issues.apache.org/jira/browse/TIKA-2459
Project: Tika
Issue Type: Bug
Affects Versions: 1.16
Environment: Windows and Linux
Reporter: Dustin Spicuzza
Attachments: foo2.doc
I've got a document whose text can be extracted via
org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get
extracted by Tika. The 'paragraph one' paragraph is present in the POI
extraction output, and is not present in Tika's output.
Tika's output:
{noformat}
Something
One:
Else
Two:
Here
Three:
Four
Paragraph two
Paragraph three
Paragraph four
cc: Somebody
Somebody else
Something here too
{noformat}
POI's output:
{noformat}
Something
One: Else
Two: Here
Three: Four
Paragraph one
Paragraph two
Paragraph three
Paragraph four
cc: Somebody
Somebody else
Something here too
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)