Hi Thejan, Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because of bugs into native code (pointers to crazy adresses) when processing corrupted images. So I changed the strategy and take the Runtime.exec way to execute tesseract out of process to get rid of those Jvm crashes.
That was a long time ago, maybe those problems are gone away with current tesseract and Tess4j. But I recommend for now commiting your changes in a new parser instead of changing the default TesseractOcrParser, until the new code is tested against millions of images from the wild with tika-batch so it can be proved it is stable enough to be the default Ocr parser of Tika. Best, Luis Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" < thejan.k.wijesin...@gmail.com> escreveu: > Hi Nick, > > I thought the same thing. I will try to keep the public method signatures > unchanged and will send updates on my progress. > > On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <apa...@gagravarr.org> wrote: > > > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: > > > >> I have already use the Tess4j API to rewrite the TesseractOCRParser > class, > >> Although It successfully extracts content from most of the file types, > it > >> fails some particular unit tests in the TesseractOCRParserTest class. I > >> can > >> solve that. However, I want to know whether I can rewrite the entire > >> TesseractOCRParser class from the ground up, but if I do that there will > >> be > >> many broken links in the internals of TIKA because as I witnessed, most > of > >> the classes use TesseractOCRParser class indirectly. > >> > > > > If you can, try to keep the public methods unchanged. That way, other > > callers to the class will be unaffected by your re-write of the internal > > logic > > > > Nick > > >