[ 
https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313422#comment-15313422
 ] 

Luis Filipe Nassif edited comment on TIKA-1994 at 6/3/16 1:09 AM:
------------------------------------------------------------------

Hi Tim,

Before the PDFBox deeper integration (good to know they are working on that!), 
I think this strategy is very good, and currently we use it in my organization 
instead of OCRing individual images inside a pdf. As you know, PDFs may have 
one image per paragraph, line, word or per char, and that can result in poor 
results with the individual image ocr approach.

As a suggestion, we count the number of extracted text chars per page and only 
do ocr if it is lower than a configurable value (we use 100 by default), 
because it suggests a high chance that the page is formed by a big (scanned) 
image. That eliminates lots of duplicate info that would be returned by ocr and 
speeds up the extraction a lot. 


was (Author: lfcnassif):
Hi Tim,

Before the PDFBox deeper integration (good to know they are working on that!), 
I think this strategy is very good, and currently we use it in my organization 
instead of OCRing individual images inside a pdf. As you know, PDFs may have 
one image per paragraph, line, word or per char, and that can result in poor 
results with the individual image ocr approach.

As a suggestion, we count the number of extracted text chars per page and only 
do ocr if it is lower than a configurable value (we use 100 by default). That 
eliminates lots of duplicate info that would be returned by ocr and speeds up 
the extraction a lot. 

> Integrate OCR with PDFParser
> ----------------------------
>
>                 Key: TIKA-1994
>                 URL: https://issues.apache.org/jira/browse/TIKA-1994
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>
> Users can now run OCR on individual images embedded inline in PDFs if they 
> get the configuration right.  
> There are some drawbacks: 1) the text appears as an attachment if using the 
> RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully 
> rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the 
> component images). 
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will 
> allow us to experiment with strategies until the cleaner integration is 
> available with PDFBox 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to