[jira] [Commented] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Tim Allison (JIRA) Fri, 08 Sep 2017 11:10:39 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159033#comment-16159033
 ]


Tim Allison commented on TIKA-2459:
-----------------------------------

It looks like Tika's {{handleSpecialCharacterRuns(...)}} goes back basically 7 
years.  [~gagravarr] or others, any idea why we don't use 
{{Range.stripFields()}} from POI for this?

> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
>                 Key: TIKA-2459
>                 URL: https://issues.apache.org/jira/browse/TIKA-2459
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: Windows and Linux
>            Reporter: Dustin Spicuzza
>             Fix For: 1.17
>
>         Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via 
> org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get 
> extracted by Tika. The 'paragraph one' paragraph is present in the POI 
> extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One:    Else
> Two:    Here
> Three:  Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Reply via email to