[ 
https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1994:
------------------------------
    Description: 
Users can now run OCR on individual images embedded inline in PDFs if they get 
the configuration right.  

There are some drawbacks: 1) the text appears as an attachment if using the 
RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully 
rendered page instead of on the individual images (this is still tbd).

It might be useful to run OCR against each rendered page (instead of the 
component images). 

Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will 
allow us to experiment with strategies until the cleaner integration is 
available with PDFBox 2.1.

  was:
Users can now run OCR on individual images embedded inline with PDFs if they do 
the right configuration.  

It might be useful to run OCR against each rendered page (instead of the 
component images). 

Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will 
allow us to experiment with strategies until the cleaner integration is 
available with PDFBox 2.1.


> Integrate OCR with PDFParser
> ----------------------------
>
>                 Key: TIKA-1994
>                 URL: https://issues.apache.org/jira/browse/TIKA-1994
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>
> Users can now run OCR on individual images embedded inline in PDFs if they 
> get the configuration right.  
> There are some drawbacks: 1) the text appears as an attachment if using the 
> RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully 
> rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the 
> component images). 
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will 
> allow us to experiment with strategies until the cleaner integration is 
> available with PDFBox 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to