[ 
https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313972#comment-15313972
 ] 

Tim Allison commented on TIKA-1994:
-----------------------------------

Sounds like a great strategy. That will catch image-only (or image-mostly) 
pages.  Let's open a separate issue to track other strategies beyond what I 
initially put in.

bq. because it suggests a high chance that the page is formed by a big 
(scanned) image 

Note that we process the page and then run OCR (if the strategy is ocr+text).  
We could gather info about the size/number of the images before making the 
determination.

bq. speeds up the extraction a lot.

Y, I have to admit, I've been _really_ impressed by the quality of Tesseract 
(on English, at least)...but the speed is an area of concern.

I'm hoping to run "ocr_only" against some of our corpus over the weekend and 
compare that with "no_ocr."  In addition to 'run ocr if there's only a little 
text', it would be neat to be able to run ocr if there is 'bad text' 
(TIKA-1443).

Have you done any experiments on dpi setting/image format/image type on OCR 
performance?  Does 200 dpi PNG GRAY do better than 200 dpi JPEG RGB...for 
example?

> Integrate OCR with PDFParser
> ----------------------------
>
>                 Key: TIKA-1994
>                 URL: https://issues.apache.org/jira/browse/TIKA-1994
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>
> Users can now run OCR on individual images embedded inline in PDFs if they 
> get the configuration right.  
> There are some drawbacks: 1) the text appears as an attachment if using the 
> RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully 
> rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the 
> component images). 
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will 
> allow us to experiment with strategies until the cleaner integration is 
> available with PDFBox 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to