[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

ASF GitHub Bot (JIRA) Wed, 21 Feb 2018 13:24:31 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372029#comment-16372029
 ]


ASF GitHub Bot commented on TIKA-2582:
--------------------------------------

ewanmellor opened a new pull request #222: Fix for TIKA-2582 contributed by 
ewanmellor.
URL: https://github.com/apache/tika/pull/222
 
 
   Tesseract 4.0 includes a change to use form feed characters to separate
   pages by default in its text output. Previous versions used no separator
   unless you specified the include_page_breaks option.
   
   This confuses any parser that is not expecting the FF.
   ODFParserTest.testOO2Metadata fails, because it is expecting the output of
   a blank image to be the empty string, but now the FF is there.
   
   I haven't seen any other failures, but I expect that user code will now see
   either FF or U+FFFD where they are not expecting it (SafeContentHandler
   replaces the FF with U+FFFD when converting to text to XML).
   
   Fix this by setting Tesseract's page_separator option to the empty string.
   This will preserve the no-page-breaks behavior with both Tesseract 3.x and
   4.0.
   
   Also, add an option TesseractOCRConfig.pageSeparator so that user code can
   request the FF or any other separator, if they want it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Tesseract 4.0 includes a FF character by default, breaking parsers
> ------------------------------------------------------------------
>
>                 Key: TIKA-2582
>                 URL: https://issues.apache.org/jira/browse/TIKA-2582
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Ewan Mellor
>            Priority: Major
>
> Tesseract 4.0 includes a change to use form feed characters to separate pages 
> by default in its text output. Previous versions used no separator unless you 
> specified the include_page_breaks option.
> This confuses any parser that is not expecting the FF. 
> ODFParserTest.testOO2Metadata fails, because it is expecting the output of a 
> blank image to be the empty string, but now the FF is there.
> I haven't seen any other failures, but I expect that user code will now see 
> either FF or U+FFFD where they are not expecting it (SafeContentHandler 
> replaces the FF with U+FFFD when converting to text to XML).
> We should set the appropriate Tesseract options to disable this behavior 
> unless explicitly requested by user code, to avoid the change in behavior.
> For reference, the Tesseract change is as follows:
> {quote}commit 2cc531e6bf0288fc8a9ad1c123a252395f00bf56
>  Merge: 3bb573ae aa6eb6bd
>  Author: zdenop <[email protected]>
>  Date: Tue Sep 19 08:41:08 2017 +0200
> Merge pull request #1140 from stweil/pagebreak
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> commit aa6eb6bd466101a3b89880f87580471a7694359d
>  Author: Stefan Weil <[email protected]>
>  Date: Mon Jun 12 19:42:45 2017 +0200
> Remove Tesseract parameter "include_page_breaks" and use FF by default
> Now Tesseract adds a page break (normally form feed) by default.
> It is still possible to suppress page breaks by setting an empty
>  page_separator.
> Signed-off-by: Stefan Weil <[email protected]>
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2582) Tesseract 4.0 includes a FF character by default, breaking parsers

Reply via email to