[
https://issues.apache.org/jira/browse/TIKA-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Palsulich resolved TIKA-630.
----------------------------------
Resolution: Fixed
> Dealing with PDF documents from scanning programs
> -------------------------------------------------
>
> Key: TIKA-630
> URL: https://issues.apache.org/jira/browse/TIKA-630
> Project: Tika
> Issue Type: Improvement
> Components: general
> Affects Versions: 0.10
> Reporter: Joseph Vychtrle
> Priority: Minor
> Labels: ocr, pdf,
>
> Hey,
> sorry I didn't post this to mailing list, I kinda didn't get the confirmation.
> The issue is that often people don't even realize there is a difference in
> pdf documents (extracted from openoffice/ms office or pdf from a scanner
> software). And if Tika processes such a document, it detects pdf content
> type, but there are only images in there. I don't know how to deal with that.
> There should be a function that decides on the type of PDF document so that I
> can take it and use some OCR software for the PDF from scanner software.
> If there is a way to do that, could please anybody explain how to do that ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)