Tim Allison created TIKA-3270:
---------------------------------
Summary: Render non-text in PDFs for OCR
Key: TIKA-3270
URL: https://issues.apache.org/jira/browse/TIKA-3270
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
When we render a PDF page for OCR, we are relying on PDFBox to render all of
the contents of the page, including text that may be available via regular
extraction methods.
The result of this is that if a user selects ocr_and_text, there can be
duplicate text -- text as stored in PDFs and the text generated via OCR. In
the xhtml output, we do mark a separate "div" for OCR so that users can
distinguish, but still, it might be useful not to have to run OCR on text that
was reliably extracted.
One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a
technical/implementation recommendation by [~tilman] to subclass PDFRenderer
and PageDrawer to render only the image components of a page.
This would be a new, non-breaking feature. This is not a blocker on 2.0.0.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)