[ 
https://issues.apache.org/jira/browse/TIKA-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341027#comment-17341027
 ] 

Peter Kronenberg commented on TIKA-3361:
----------------------------------------

So I'd like to try to restart this conversation. Here is another possibility 
that avoids use of the words Best or Fast.

We can just have a range of values and call them Level1, Level2, etc. In all 
cases, the limit for totalCharsPerPage remains at 10. I still think this is a 
good heuristic for whether or not the page consists of characters or a scanned 
image.


 The value for UnmappedUnicodeCharactersPerPage would change as follows
 * Leve1 = 0 – In other words, if there is a single Unmapped Character, then we 
will use OCR. If no Unmapped Characters, we would still look at 
totalsCharsPerPage. This would be equivalent to BEST (but we won't necessarily 
use that keyword)
 * Level2 = 20 – this is the current behavior and would be the default
 * Level3 = 10% – this is my own personal opinion on a reasonable value
 * Level4= -1 – Infinity - We would never do OCR based on Unmapped Characters 
(although if totalsCharsPerPage > 10, we would OCR). This would be equivalent 
to FAST

I'm open to other names besides Level<n>.   This shows 4 pre-defined values.  
We could have others. 

 

Of coursed, the user could still define his own values by specifying *n, m* 
instead of using one of the pre-defined keywords

>  Improve intelligence of OCRStrategy=AUTO
> -----------------------------------------
>
>                 Key: TIKA-3361
>                 URL: https://issues.apache.org/jira/browse/TIKA-3361
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Peter Kronenberg
>            Priority: Major
>
> Didn’t get a whole lot of feedback on the mailing list, so here’s my attempt 
> at improving OCRStrategy=Auto
> Currently, this strategy performs the following test
> {code:java}
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
>                     doOCROnCurrentPage(AUTO);
>                 }
> {code}
> I added a way to change the new numbers involved: the threshold for the total 
> characters per page (below which, we OCR the page), and the threshold for 
> unmapped characters (above which we OCR the page)
> My main concern is with the unmapped characters. OCR adds a lot of overhead, 
> which might not be necessary for simply a few unmapped characters
> I added a new config, *OCRStrategyAuto*, which is only used if 
> OCRStrategy=AUTO. Its format is
> {code:java}
> ocrStrategyAuto = best|fast|m[%], n
> {code}
> ‘best’ and ‘fast’ are shortcuts. More later
> m, n – m is the threshold for the number of unmapped characters per page. It 
> can also be specified as a percentage. So, m=20 means if your page has more 
> than 20 unmapped characters, it will OCR. m=20% means if the unmapped 
> characters are more than 20% of the total characters, then it will OCR.
> n is the threshold for the total number of characters on the page. n does not 
> need to be specified and defaults to 10
> {code:java}
> <param name="ocrStrategyAuto" type="string">20</param>
> {code}
> is equivalent to
> {code:java}
> <param name="ocrStrategyAuto" type="string">20, 10</param>
> {code}
> *best* is shorthand for *20,10*
> {code:java}
> <param name="ocrStrategyAuto" type="string">best</param>
> {code}
> is equivalent to
> {code:java}
> <param name="ocrStrategyAuto" type="string">20, 10</param>
> {code}
> *best* is the default and is equivalent to the current behavior
>  *fast* is a shortcut for *10%, 10*, which will avoid OCR unless the number 
> of unmapped characters is greater than 10%
> {code:java}
> <param name="ocrStrategyAuto" type="string">fast</param>
> {code}
> is equivalent to
> {code:java}
> <param name="ocrStrategyAuto" type="string">10%, 10</param>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to