Hi everyone! Luis, It is my pleasure to meet an original creator of a major component of TIKA. I should say that it is very creative + reliable workaround. :) I still have many unclear areas in TIKA parsers. Perhaps you can help me to clarify some of them.
Even now, there must be some unreliability in Tess4j because I also got some jvm crashing issues when trying to test it first but however I got through them. I can't exactly say whether this is going to be a perfect implementation without testing this properly. However, I'll try my best to make this work. I have crated an jira issue for this [1] <https://issues.apache.org/jira/browse/TIKA-2293>. I invite you all to help me, make this a success. [1] https://issues.apache.org/jira/browse/TIKA-2293 On Tue, Mar 7, 2017 at 9:42 PM, Thamme Gowda <[email protected]> wrote: > yes, we can try tika-eval to see the difference. Perfect! > > Best, > TG > > On Mar 7, 2017 7:44 AM, "Allison, Timothy B." <[email protected]> wrote: > > Y and why not give the new tika-eval module a trial to evaluate the > differences in output? :) > > -----Original Message----- > From: Thamme Gowda [mailto:[email protected]] > Sent: Tuesday, March 7, 2017 10:38 AM > To: Thejan Wijesinghe <[email protected]> > Cc: [email protected] > Subject: Re: Tess4j API for TIKA OCR parser > > Thanks Nick for the reply. > > Thejan, > > I am glad to know your progress. Rewriting the TesseractOCRParser would be > the ultimate goal if using Tess4j proves to be better than the way it is > done currently. > > But, for now, please consider these: > + Rename your class to *Tess4jOCRParser*. It is a new parser providing > + the > same functionality as *TesseractOCRParser* > + Keep the *TesseractOCRParser* intact. You can use it as your reference > + to > understand features of OCR parser to support. > + Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to > performance and stability. You can take a set of 100 images and compare > how much time each of them took. Please share those results here. > > > Based on the benchmark, we can decide whether to replace old one with new > one. Because TesseractOCRParser is used along with many other parsers like > JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a > huge effect! > > P.S. > + Please don't edit any test cases. You may add new ones, though! > + Could you please create a Jira Issue to track this. Sorry, I must have > said this early. > > Best, > TG > > > On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe < > [email protected]> wrote: > > > Hi Nick, > > > > I thought the same thing. I will try to keep the public method > > signatures unchanged and will send updates on my progress. > > > > On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <[email protected]> wrote: > > > > > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote: > > > > > >> I have already use the Tess4j API to rewrite the TesseractOCRParser > > class, > > >> Although It successfully extracts content from most of the file > > >> types, > > it > > >> fails some particular unit tests in the TesseractOCRParserTest > > >> class. I can solve that. However, I want to know whether I can > > >> rewrite the entire TesseractOCRParser class from the ground up, but > > >> if I do that there will be many broken links in the internals of > > >> TIKA because as I witnessed, most > > of > > >> the classes use TesseractOCRParser class indirectly. > > >> > > > > > > If you can, try to keep the public methods unchanged. That way, > > > other callers to the class will be unaffected by your re-write of > > > the internal logic > > > > > > Nick > > > > > > > >
