Hi Sergey,

I definitely get the challenges. In fact recently we merged the PDF module into the Multimedia module due to the tight coupling around the TesseractOCR[1] [2]. We could look into separating the PDF parser out again but I'm a bit short on a simple way to do it with TesseractOCR in play. Like Tim I'm hesitant to change structure but we definitely need to address how we handle embedded parsers. I've done some work with the ParserProxy class to remove some of the hard dependencies between parsers. With that we only pull in the parsers available on the class path. There an example in the JackcessExtractor class in the office module.

What is the motivation behind excluding the other parsers in your usecase? Smaller footprint? Incompatibility? Performance?

Depending on the the driver there may be other ways to get you to a similar place.

Smaller footprint

You could just include the modules you need and any embedded parsers from other modules could be added via a ParserProxy. This might not remove all the parsers you don't need but might be a good start. The most trimmed down way is what you've provided below in your example creating a tika-parser-pdf-module-all. I'm concerned about the number of combinations we might end up creating.

Incompatibility

You might want to look at the tika-parser-bundle projects since putting the modules in an OSGi container will allow you isolate the classloaders.

Performance

A combination of the above or you might look to include a tika-config.xml and just exclude the parsers you don't want. That should prevent them from being a part of your pipeline.

Other ideas on this?  I think it's an important thing to discuss.


- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059


On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:
Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without advanced or 
scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely document 
this well, though!

-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Thursday, September 15, 2016 12:15 PM
To: [email protected]
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort of 
embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for 
users to pick up only individual parsers. So I've added PDFParser & 
OpenDocumentParser and tike-core to the project dependencies and all works very 
nice when I submit to the demo a simple PDF.

But if I were to write the code which can handle the embedded attachments 
really well then I think I'll probably need to revert to depending on all of 
tika-parsers - otherwise how would I know which additional parser modules I 
should add ? If this reasoning is right then one can only use individual 
modules in the production if it is well-known the files to be processed will 
have no unexpected formats embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' 
modules for most used formats, which would offer less than tika-parsers but 
more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
    tika-parser-pdf-module
    (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be needed to 
process various PDF attachments ? This list of the extra deps will be based on 
the accumulated knowledge. Similarly for few other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more modules 
than tika-parser-pdf-module but significantly less than tike-parsers


Cheers, Sergey



Reply via email to