Tim Allison created TIKA-3270:
---------------------------------

             Summary: Render non-text in PDFs for OCR
                 Key: TIKA-3270
                 URL: https://issues.apache.org/jira/browse/TIKA-3270
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


When we render a PDF page for OCR, we are relying on PDFBox to render all of 
the contents of the page, including text that may be available via regular 
extraction methods.

The result of this is that if a user selects ocr_and_text, there can be 
duplicate text -- text as stored in PDFs and the text generated via OCR.  In 
the xhtml output, we do mark a separate "div" for OCR so that users can 
distinguish, but still, it might be useful not to have to run OCR on text that 
was reliably extracted.

One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a 
technical/implementation recommendation by [~tilman] to subclass PDFRenderer 
and PageDrawer to render only the image components of a page.

This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to