Randal Moss created TIKA-1790:
---------------------------------
Summary: Enhancement for extracting text from pdfs
Key: TIKA-1790
URL: https://issues.apache.org/jira/browse/TIKA-1790
Project: Tika
Issue Type: Improvement
Components: example, parser
Reporter: Randal Moss
Priority: Minor
This enhancement would attempt to extract more text from multicolored
background images in PDFs by using adaptive threshold binarization before
applying Tesseract for OCR. It also tries to extract text from vector images
inside PDFs by first rasterizing them (using Ghostscript) and then applying
Tesseract to the flattened images. The final output would be a text file
containing all previously extracted text.
I would want to integrate this as a separate library from Tika that is similar
to how the [GeoTopicParser|https://wiki.apache.org/tika/GeoTopicParser] is
handled.
The code that I have is still a work in progress and can be found
[here|https://github.com/RandalMoss/pdf-search].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)