Re: [jira] [Commented] (TIKA-93) OCR support

Oleg Tikhonov Sat, 08 Feb 2014 23:15:02 -0800

Hi,
There is another code coverage maven plug-in, called cobertura.
If you run *mvn clean install cobertura:cobertura* no need to put it in the
pom.


Hope it helps.




On Sat, Feb 8, 2014 at 10:17 PM, Grant Ingersoll (JIRA) <j...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895718#comment-13895718]
>
> Grant Ingersoll commented on TIKA-93:
> -------------------------------------
>
> bq. what is the dependency on jacoco in tika-parent? That stuff seems
> orthogonal to the patch.
>
> I put that in so that I can measure whether I am testing sufficiently.  I
> can separate it out to a different patch.
>
> bq. dependency on custom external Maven repo – myGrid – any way to get the
> jar from the Central repo somewhere? we have made an effort in Tika to
> remove any specific deps on external repositories
>
> We could make that one optional.  All it does is add support for TIFF and
> a few other file formats that aren't part of the standard ImageIO.
>
> bq.  in my CS572 class on Search Engines where we look at FBI Vault PDF
> files!  http://www-scf.usc.edu/~csci572/
>
> I read your abstract for your talk and checked out the Vault and thought
> it would be cool, too.  The main issue is that JavaOCR needs to be trained
> in order to work with that data set.  Tesseract, on the other hand, works
> for it, but alas, needs to be implemented as an OCRParser.  Since Tess4J
> has some bad deps, the only way I could see to do this is to exec the
> process or go write my own JNI integration for Tesseract.  The latter isn't
> likely to happen.  The former feels less than desirable, but would work.
>
> > OCR support
> > -----------
> >
> >                 Key: TIKA-93
> >                 URL: https://issues.apache.org/jira/browse/TIKA-93
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: parser
> >            Reporter: Jukka Zitting
> >            Assignee: Chris A. Mattmann
> >            Priority: Minor
> >         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
> >
> >
> > I don't know of any decent open source pure Java OCR libraries, but
> there are command line OCR tools like Tesseract (
> http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to
> extract text content (where available) from image files.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.5#6160)
>

Re: [jira] [Commented] (TIKA-93) OCR support

Reply via email to