[ 
https://issues.apache.org/jira/browse/TIKA-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Herber updated TIKA-3002:
---------------------------------
    Description: 
For performance reasons, I would like to activate the OCR scanning only when 
necessary. I therefore tried to set the OCR strategy to "AUTO".

However, I see that also for "normal" PDF files (where no OCR should be 
required), OCR is performed and this not also slows down the application but 
(more important) results in doubling the resulting text.

Trying to understand how this works, I think I may have found a possible error 
in the class *AbstractPDF2XHTML*. There, in case of selected OCR Strategy AUTO, 
on line 404 the total number of characters found on the page is checked: if 
this is lower than 10 OCR is performed.
{code:java}
} else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO)) {
    //TODO add more sophistication
    if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
        doOCROnCurrentPage();
    }
}
{code}
The logic is correct, but unfortunately at the beginning of the method (line 
361 and 362) the two variables checked on this line are reset to 0, so this 
conditions is going to be always true.

I would suggest to move the reset of the two variables inside a finally block 
at the end of the method.

  was:
For performance reasons, I would like to activate the OCR scanning only when 
necessary. I therefore tried to set the OCR strategy to "AUTO".

However I see that also for "normal" PDF files (where no OCR should be 
required), OCR is performed and this not also slows down the application but 
(more important) results in doubling the resulting text.

Trying to understand how this works, I think I may have found a possible error 
in the class *AbstractPDF2XHTML*. There, in case of selected OCR Strategy AUTO, 
on line 404 is checked the total number of characters found on the page, if 
this is lower than 10 OCR is performed:

 
{code:java}
} else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO)) {
    //TODO add more sophistication
    if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
        doOCROnCurrentPage();
    }
}
{code}
 

The logic is correct, but unfortunately at the beginning of the method (line 
361 and 362) the two variables checked on this line are reset to 0, so this 
conditions will always be true.

I would suggest to move the reset of the two variables inside a finally block 
at the end of the method.


> Possible bug with OCR strategy AUTO
> -----------------------------------
>
>                 Key: TIKA-3002
>                 URL: https://issues.apache.org/jira/browse/TIKA-3002
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, parser
>    Affects Versions: 1.22
>            Reporter: Patrick Herber
>            Priority: Major
>
> For performance reasons, I would like to activate the OCR scanning only when 
> necessary. I therefore tried to set the OCR strategy to "AUTO".
> However, I see that also for "normal" PDF files (where no OCR should be 
> required), OCR is performed and this not also slows down the application but 
> (more important) results in doubling the resulting text.
> Trying to understand how this works, I think I may have found a possible 
> error in the class *AbstractPDF2XHTML*. There, in case of selected OCR 
> Strategy AUTO, on line 404 the total number of characters found on the page 
> is checked: if this is lower than 10 OCR is performed.
> {code:java}
> } else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO)) 
> {
>     //TODO add more sophistication
>     if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
>         doOCROnCurrentPage();
>     }
> }
> {code}
> The logic is correct, but unfortunately at the beginning of the method (line 
> 361 and 362) the two variables checked on this line are reset to 0, so this 
> conditions is going to be always true.
> I would suggest to move the reset of the two variables inside a finally block 
> at the end of the method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to