[
https://issues.apache.org/jira/browse/TIKA-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986381#comment-16986381
]
Hudson commented on TIKA-3002:
------------------------------
SUCCESS: Integrated in Jenkins build Tika-trunk #1754 (See
[https://builds.apache.org/job/Tika-trunk/1754/])
TIKA-3002 -- fix bug in OCR AUTO mode (tallison:
[https://github.com/apache/tika/commit/f5edbbd60ef22cce3fc2c8c23e617489d42be29f])
* (edit)
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit)
tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
> Possible bug with OCR strategy AUTO
> -----------------------------------
>
> Key: TIKA-3002
> URL: https://issues.apache.org/jira/browse/TIKA-3002
> Project: Tika
> Issue Type: Bug
> Components: ocr, parser
> Affects Versions: 1.22
> Reporter: Patrick Herber
> Priority: Major
>
> For performance reasons, I would like to activate the OCR scanning only when
> necessary. I therefore tried to set the OCR strategy to "AUTO".
> However, I see that also for "normal" PDF files (where no OCR should be
> required), OCR is performed and this not also slows down the application but
> (more important) results in doubling the resulting text.
> Trying to understand how this works, I think I may have found a possible
> error in the class *AbstractPDF2XHTML*. There, in case of selected OCR
> Strategy AUTO, on line 404 the total number of characters found on the page
> is checked: if this is lower than 10 OCR is performed.
> {code:java}
> } else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO))
> {
> //TODO add more sophistication
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
> doOCROnCurrentPage();
> }
> }
> {code}
> The logic is correct, but unfortunately at the beginning of the method (line
> 361 and 362) the two variables checked on this line are reset to 0, so this
> conditions is going to be always true.
> I would suggest to move the reset of the two variables inside a finally block
> at the end of the method.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)