[ 
https://issues.apache.org/jira/browse/TIKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939076#comment-15939076
 ] 

Tim Allison commented on TIKA-2293:
-----------------------------------

[~Thejan], Thank you for your work on this parser and for digging into the 
license issues.

I regret that I'm against adding tess4j into Tika even with the rococoa license 
issue taken away.

We currently have a wall that users must climb over -- they have to have 
Tesseract installed.  That requires a certain amount of technical know-how, 
and, from a support perspective, while we can try to help, we can say that it 
isn't our responsibility to install Tesseract for them.  

With tess4j, the burden would be on us to hope that the embedded dlls work for 
Windows users, and it would kind of be on the user to get the right .so for 
themselves if they're on Linux.  The clear wall we currently have disappears, 
and we now have to help people in this uncertain area where when things go 
wrong, it is kind of our fault and kind of not our fault.

We have the same wall now for language packs, and I agree, we would _NOT_ want 
to ship all the language packs.  However, now, the user is responsible for 
getting his/her own language packs, and we're not in this in-between state as 
we would be with tess4j where we're giving people the English pack and then we 
have to support them installing the other language packs. 

Also, with Tess4j, we'd be nearly doubling the size of tika-app/tika-server 
with _just_ the Windows dlls, and that doesn't even include the Linux .so(s?).

In my opinion, I'd far prefer our current setup with the overhead of the 
commandline and slightly slower OCR'ing to the above headaches we'd have with 
tess4j.

I would very strongly support adding this as a standalone parser on your own 
github site or on another third-party site, and I'd be more than happy to 
promote it and point people to it.  I'd also want to run an evaluation on 
quality/speed between this third party Tesseract integration and ours so that 
we can understand the differences.

>  Tess4jOCRParser - A simpler Java version of TesseractOCRParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-2293
>                 URL: https://issues.apache.org/jira/browse/TIKA-2293
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: Thejan Wijesinghe
>             Fix For: 1.15
>
>
> Right now, TesseractOCRParser calls tesseract and imagemagick from command 
> line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API 
> instead of the runtime.exec way to executing tesseract out of process.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to