[ 
https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316449#comment-15316449
 ] 

Luis Filipe Nassif commented on TIKA-1994:
------------------------------------------

Using info like number and size of imagens per page before decision would be 
great.

Yes, I have done some experiments a few years ago about these settings 
(150x200x300dpi, b&w x gray x rgb). Tesseract suggests 300dpi for 10 point 
fonts, but I got very good results and speed with 200dpi grayscale with my very 
limited corpus (portuguese language, font size larger than 10p) that time. Png 
format is better than jpeg, it is lossless, has less noise and is recommended 
by tesseract too.

> Integrate OCR with PDFParser
> ----------------------------
>
>                 Key: TIKA-1994
>                 URL: https://issues.apache.org/jira/browse/TIKA-1994
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 2.0, 1.14
>
>
> Users can now run OCR on individual images embedded inline in PDFs if they 
> get the configuration right.  
> There are some drawbacks: 1) the text appears as an attachment if using the 
> RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully 
> rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the 
> component images). 
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will 
> allow us to experiment with strategies until the cleaner integration is 
> available with PDFBox 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to