[
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895698#comment-13895698
]
Chris A. Mattmann commented on TIKA-93:
---------------------------------------
Hey Grant, patch is looking good! I will need to download it and test it out,
but this is just based on a cursory inspection.
Some comments:
# what is the dependency on jacoco in tika-parent? That stuff seems orthogonal
to the patch.
# maybe think about providing the training directory as part of the
ParseContext (maybe a property like o.a.tika.parser.ocr.trainingDataDirPath?)
# dependency on custom external Maven repo -- myGrid -- any way to get the jar
from the Central repo somewhere? we have made an effort in Tika to remove any
specific deps on external repositories, see:
http://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/#.UvaEN0JdWxU
Looking great. Maybe we can get some of this in 1.6 even with the deps on the
external repo but we need to get rid of those before releasing. I will try this
out in a few hours! I'm excited b/c I may even be able to use this for the
homework assignments in my CS572 class on Search Engines where we look at FBI
Vault PDF files! :) http://www-scf.usc.edu/~csci572/
> OCR support
> -----------
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Chris A. Mattmann
> Priority: Minor
> Attachments: TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are
> command line OCR tools like Tesseract
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)