+1

Same experience, of same vintage. :)

-----Original Message-----
From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] 
Sent: Tuesday, March 7, 2017 10:34 AM
To: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use 
Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems 
like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because 
of bugs into native code (pointers to crazy adresses) when processing corrupted 
images. So I changed the strategy and take the Runtime.exec way to execute 
tesseract out of process to get rid of those Jvm crashes.

That was a long time ago, maybe those problems are gone away with current 
tesseract and Tess4j. But I recommend for now commiting your changes in a new 
parser instead of changing the default TesseractOcrParser, until the new code 
is tested against millions of images from the wild with tika-batch so it can be 
proved it is stable enough to be the default Ocr parser of Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" < 
thejan.k.wijesin...@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
>

Reply via email to