Re: Tess4j API for TIKA OCR parser

Thejan Wijesinghe Wed, 08 Mar 2017 01:58:47 -0800

Hi everyone!

Luis, It is my pleasure to meet an original creator of a major component of
TIKA. I should say that it is very creative + reliable workaround. :) I
still have many unclear areas in TIKA parsers. Perhaps you can help me to
clarify some of them.


Even now, there must be some unreliability in Tess4j because I also got
some jvm crashing issues when trying to test it first but however I got
through them. I can't exactly say whether this is going to be a perfect
implementation without testing this properly. However, I'll try my best to
make this work. I have crated an jira issue for this [1]
<https://issues.apache.org/jira/browse/TIKA-2293>. I invite you all to help
me, make this a success.

[1] https://issues.apache.org/jira/browse/TIKA-2293

On Tue, Mar 7, 2017 at 9:42 PM, Thamme Gowda <[email protected]> wrote:

> yes, we can try tika-eval to see the difference. Perfect!
>
> Best,
> TG
>
> On Mar 7, 2017 7:44 AM, "Allison, Timothy B." <[email protected]> wrote:
>
> Y and why not give the new tika-eval module a trial to evaluate the
> differences in output?  :)
>
> -----Original Message-----
> From: Thamme Gowda [mailto:[email protected]]
> Sent: Tuesday, March 7, 2017 10:38 AM
> To: Thejan Wijesinghe <[email protected]>
> Cc: [email protected]
> Subject: Re: Tess4j API for TIKA OCR parser
>
> Thanks Nick for the reply.
>
> Thejan,
>
> I am glad to know your progress. Rewriting the TesseractOCRParser would be
> the ultimate goal if using Tess4j proves to be better than the way it is
> done currently.
>
> But, for now, please consider these:
> + Rename your class to *Tess4jOCRParser*. It is a new parser providing
> + the
> same functionality as *TesseractOCRParser*
> + Keep the *TesseractOCRParser* intact. You can use it as your reference
> + to
> understand features of OCR parser to support.
> + Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
> performance and stability. You can take a set of 100 images and compare
> how much time each of them took. Please share those results here.
>
>
> Based on the benchmark, we can decide whether to replace old one with new
> one. Because TesseractOCRParser is used along with many other parsers like
> JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
> huge effect!
>
> P.S.
> + Please don't edit any test cases. You may add new ones, though!
> + Could you please create a Jira Issue to track this. Sorry, I must have
> said this early.
>
> Best,
> TG
>
>
> On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
> [email protected]> wrote:
>
> > Hi Nick,
> >
> > I thought the same thing. I will try to keep the public method
> > signatures unchanged and will send updates on my progress.
> >
> > On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <[email protected]> wrote:
> >
> > > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> > >
> > >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> > class,
> > >> Although It successfully extracts content from most of the file
> > >> types,
> > it
> > >> fails some particular unit tests in the TesseractOCRParserTest
> > >> class. I can solve that. However, I want to know whether I can
> > >> rewrite the entire TesseractOCRParser class from the ground up, but
> > >> if I do that there will be many broken links in the internals of
> > >> TIKA because as I witnessed, most
> > of
> > >> the classes use TesseractOCRParser class indirectly.
> > >>
> > >
> > > If you can, try to keep the public methods unchanged. That way,
> > > other callers to the class will be unaffected by your re-write of
> > > the internal logic
> > >
> > > Nick
> > >
> >
>
>
>

Re: Tess4j API for TIKA OCR parser

Reply via email to