[jira] [Commented] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Hudson (JIRA) Fri, 08 Sep 2017 10:57:58 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159017#comment-16159017
 ]


Hudson commented on TIKA-2459:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1361 (See 
[https://builds.apache.org/job/Tika-trunk/1361/])
TIKA-2459 -- fix special character handling (tallison: 
[https://github.com/apache/tika/commit/d1a8bff9faacb828a1039f7cc2c7f9e1f1d5e3fd])
* (add) 
tika-parsers/src/test/resources/test-documents/testWORD_specialControlCharacter1415.doc
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java


> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
>                 Key: TIKA-2459
>                 URL: https://issues.apache.org/jira/browse/TIKA-2459
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: Windows and Linux
>            Reporter: Dustin Spicuzza
>             Fix For: 1.17
>
>         Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via 
> org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get 
> extracted by Tika. The 'paragraph one' paragraph is present in the POI 
> extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One:    Else
> Two:    Here
> Three:  Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Reply via email to