Tim Allison created TIKA-3258:
---------------------------------

             Summary: Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
                 Key: TIKA-3258
                 URL: https://issues.apache.org/jira/browse/TIKA-3258
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


In Tika 1.x we currently have the fiddly mess that users have to configure OCR 
of PDFs...it doesn't just work out of the box.  We did this initially because 
of concerns (well, reality) of crazy resource consumption for some PDFs that 
can have thousands of images per page that are stitched together to make a 
reasonable composite.

Since then, we've added option 2, which renders each page and then runs OCR on 
that composite image rather than running OCR on each inline image...so we'll 
only call tesseract once per page.  Second, we've added an 'auto' mode that 
runs OCR only on pages that didn't have much text extracted.  While there is 
plenty of room for improvement in the 'auto' heuristic, I think we should move 
to running OCR automatically on PDFs as default in 2.0.0. 

Users will now have to disable OCR if they don't want it.

This will be a breaking change, and we'll make sure to document it early and 
often in the "Breaking Changes" sections of the readme.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to