[
https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314585#comment-15314585
]
Tim Allison commented on TIKA-1994:
-----------------------------------
[~bobpaulin], in 2.0, if we keep the current set up, PDFParser will now have to
depend on tika-parser-multimedia-module. Not too awful, but another
intermodule dependency that I'd prefer not to add.
I thought about moving the TesseractOCRParser into its own module, but it
currently depends on the image parsers for metadata (thanks to my complaints
:)). I _think_ by the time 2.0 is ready, we'll get rid of that dependency and
let the user choose to combine OCR+image metadata (once we can combine
parsers)...so, down the road, I think it might make sense to break the ocr
parser into its own module.
Thoughts, obvious solutions?
> Integrate OCR with PDFParser
> ----------------------------
>
> Key: TIKA-1994
> URL: https://issues.apache.org/jira/browse/TIKA-1994
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
>
> Users can now run OCR on individual images embedded inline in PDFs if they
> get the configuration right.
> There are some drawbacks: 1) the text appears as an attachment if using the
> RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully
> rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the
> component images).
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912). This will
> allow us to experiment with strategies until the cleaner integration is
> available with PDFBox 2.1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)