Re: Tika 2.0: Restructuring Tesseract

Chris Mattmann Thu, 25 Aug 2016 21:16:06 -0700

I like simple – I vote for option 1 ☺



On 8/25/16, 9:06 PM, "Bob Paulin" <[email protected]> wrote:

    Hi,
    
    I've been looking at some of the work recently with Tesseract and it's 
    really cool to be able to get OCR combine with so many parsers.  The bad 
    part is it has really coupled the multimedia and pdf modules together.  
    I see a couple of ways forward.
    
    The first and simplest is combining multimedia and pdf into a single 
    module.  This is already essentially what we're doing by adding it as a 
    hard dependency to the pdf modules and embedding it in the bundle.  We 
    loose some granularity with the modules but we're no longer embedding 
    modules inside of modules.
    
    The second option would be to restructure how we're calling Tesseract.  
    Currently we've embedded tesseract into the PDF module to render PDFs as 
    images and run them through ocr.  We've done the opposite with the 
    ImageParser, TiffParser, and JpegParser by embedding each of these three 
    parsers into the TesseractParser. We could put a reference to the 
    tesseract parser inside the ImageParser, TiffParser, and JpegParser with 
    an EmbeddedContentHandler (so the tesseract data would be the embedded 
    content).  Then we could use a ParserProxy to use any of those 3 image 
    related parsers within PDF if multimedia is included on the classpath or 
    otherwise wired up by OSGi.
    
    Thoughts on either of these approaches?  Other suggestions?
    
    
    - Bob

Re: Tika 2.0: Restructuring Tesseract

Reply via email to