Re: PDF with embedded attachments and Tika 2.0 modularity

Bob Paulin Thu, 15 Sep 2016 10:06:44 -0700

Hi Sergey,

I definitely get the challenges. In fact recently we merged the PDFmodule into the Multimedia module due to the tight coupling around theTesseractOCR[1] [2]. We could look into separating the PDF parser outagain but I'm a bit short on a simple way to do it with TesseractOCR inplay. Like Tim I'm hesitant to change structure but we definitely needto address how we handle embedded parsers. I've done some work with theParserProxy class to remove some of the hard dependencies betweenparsers. With that we only pull in the parsers available on the classpath. There an example in the JackcessExtractor class in the office module.

What is the motivation behind excluding the other parsers in yourusecase? Smaller footprint? Incompatibility? Performance?

Depending on the the driver there may be other ways to get you to asimilar place.


Smaller footprint

You could just include the modules you need and any embedded parsersfrom other modules could be added via a ParserProxy. This might notremove all the parsers you don't need but might be a good start. Themost trimmed down way is what you've provided below in your examplecreating a tika-parser-pdf-module-all. I'm concerned about the numberof combinations we might end up creating.


Incompatibility

You might want to look at the tika-parser-bundle projects since puttingthe modules in an OSGi container will allow you isolate the classloaders.


Performance

A combination of the above or you might look to include atika-config.xml and just exclude the parsers you don't want. Thatshould prevent them from being a part of your pipeline.


Other ideas on this?  I think it's an important thing to discuss.


- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059


On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:

Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without advanced or 
scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely document 
this well, though!

-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Thursday, September 15, 2016 12:15 PM
To: [email protected]
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort of 
embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for 
users to pick up only individual parsers. So I've added PDFParser & 
OpenDocumentParser and tike-core to the project dependencies and all works very 
nice when I submit to the demo a simple PDF.

But if I were to write the code which can handle the embedded attachments 
really well then I think I'll probably need to revert to depending on all of 
tika-parsers - otherwise how would I know which additional parser modules I 
should add ? If this reasoning is right then one can only use individual 
modules in the production if it is well-known the files to be processed will 
have no unexpected formats embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' 
modules for most used formats, which would offer less than tika-parsers but 
more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
    tika-parser-pdf-module
    (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be needed to 
process various PDF attachments ? This list of the extra deps will be based on 
the accumulated knowledge. Similarly for few other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more modules 
than tika-parser-pdf-module but significantly less than tike-parsers


Cheers, Sergey

Re: PDF with embedded attachments and Tika 2.0 modularity

Reply via email to