[ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419781#comment-16419781
 ] 

Hudson commented on TIKA-2582:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1463 (See 
[https://builds.apache.org/job/Tika-trunk/1463/])
Fix for TIKA-2582 contributed by ewanmellor. (commits: 
[https://github.com/apache/tika/commit/65defb20301d40397e94076a4b2011688cb94637])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java


> Tesseract 4.0 includes a FF character by default, breaking parsers
> ------------------------------------------------------------------
>
>                 Key: TIKA-2582
>                 URL: https://issues.apache.org/jira/browse/TIKA-2582
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Ewan Mellor
>            Priority: Major
>             Fix For: 1.18, 2.0.0
>
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop <[email protected]>
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil <[email protected]>
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil <[email protected]>
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to