[
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939076#comment-15939076
]
Tim Allison commented on TIKA-2293:
-----------------------------------
[~Thejan], Thank you for your work on this parser and for digging into the
license issues.
I regret that I'm against adding tess4j into Tika even with the rococoa license
issue taken away.
We currently have a wall that users must climb over -- they have to have
Tesseract installed. That requires a certain amount of technical know-how,
and, from a support perspective, while we can try to help, we can say that it
isn't our responsibility to install Tesseract for them.
With tess4j, the burden would be on us to hope that the embedded dlls work for
Windows users, and it would kind of be on the user to get the right .so for
themselves if they're on Linux. The clear wall we currently have disappears,
and we now have to help people in this uncertain area where when things go
wrong, it is kind of our fault and kind of not our fault.
We have the same wall now for language packs, and I agree, we would _NOT_ want
to ship all the language packs. However, now, the user is responsible for
getting his/her own language packs, and we're not in this in-between state as
we would be with tess4j where we're giving people the English pack and then we
have to support them installing the other language packs.
Also, with Tess4j, we'd be nearly doubling the size of tika-app/tika-server
with _just_ the Windows dlls, and that doesn't even include the Linux .so(s?).
In my opinion, I'd far prefer our current setup with the overhead of the
commandline and slightly slower OCR'ing to the above headaches we'd have with
tess4j.
I would very strongly support adding this as a standalone parser on your own
github site or on another third-party site, and I'd be more than happy to
promote it and point people to it. I'd also want to run an evaluation on
quality/speed between this third party Tesseract integration and ours so that
we can understand the differences.
> Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
> Key: TIKA-2293
> URL: https://issues.apache.org/jira/browse/TIKA-2293
> Project: Tika
> Issue Type: Improvement
> Components: ocr
> Reporter: Thejan Wijesinghe
> Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API
> instead of the runtime.exec way to executing tesseract out of process.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)