[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895698#comment-13895698
 ] 

Chris A. Mattmann commented on TIKA-93:
---------------------------------------

Hey Grant, patch is looking good! I will need to download it and test it out, 
but this is just based on a cursory inspection.
Some comments:
# what is the dependency on jacoco in tika-parent? That stuff seems orthogonal 
to the patch.
# maybe think about providing the training directory as part of the 
ParseContext (maybe a property like o.a.tika.parser.ocr.trainingDataDirPath?)
# dependency on custom external Maven repo -- myGrid -- any way to get the jar 
from the Central repo somewhere? we have made an effort in Tika to remove any 
specific deps on external repositories, see: 
http://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/#.UvaEN0JdWxU

Looking great. Maybe we can get some of this in 1.6 even with the deps on the 
external repo but we need to get rid of those before releasing. I will try this 
out in a few hours! I'm excited b/c I may even be able to use this for the 
homework assignments in my CS572 class on Search Engines where we look at FBI 
Vault PDF files! :) http://www-scf.usc.edu/~csci572/


> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to