[jira] [Commented] (TIKA-3002) Possible bug with OCR strategy AUTO

Tilman Hausherr (Jira) Fri, 29 Nov 2019 20:43:18 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985251#comment-16985251
 ]


Tilman Hausherr commented on TIKA-3002:
---------------------------------------

>From looking at the history, that segment was added in TIKA-2846 at the top of 
>that method for a different purpose, i.e. store and reset these values. The 
>"auto" strategy was added a month later in TIKA-2749 at the bottom of that 
>method.
ping [~tallison]

> Possible bug with OCR strategy AUTO
> -----------------------------------
>
>                 Key: TIKA-3002
>                 URL: https://issues.apache.org/jira/browse/TIKA-3002
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, parser
>    Affects Versions: 1.22
>            Reporter: Patrick Herber
>            Priority: Major
>
> For performance reasons, I would like to activate the OCR scanning only when 
> necessary. I therefore tried to set the OCR strategy to "AUTO".
> However, I see that also for "normal" PDF files (where no OCR should be 
> required), OCR is performed and this not also slows down the application but 
> (more important) results in doubling the resulting text.
> Trying to understand how this works, I think I may have found a possible 
> error in the class *AbstractPDF2XHTML*. There, in case of selected OCR 
> Strategy AUTO, on line 404 the total number of characters found on the page 
> is checked: if this is lower than 10 OCR is performed.
> {code:java}
> } else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO)) 
> {
>     //TODO add more sophistication
>     if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
>         doOCROnCurrentPage();
>     }
> }
> {code}
> The logic is correct, but unfortunately at the beginning of the method (line 
> 361 and 362) the two variables checked on this line are reset to 0, so this 
> conditions is going to be always true.
> I would suggest to move the reset of the two variables inside a finally block 
> at the end of the method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3002) Possible bug with OCR strategy AUTO

Reply via email to