PDF with embedded attachments and Tika 2.0 modularity

Sergey Beryozkin Thu, 15 Sep 2016 09:16:00 -0700

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort ofembedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a niceoption for users to pick up only individual parsers. So I've addedPDFParser & OpenDocumentParser and tike-core to the project dependenciesand all works very nice when I submit to the demo a simple PDF.

But if I were to write the code which can handle the embeddedattachments really well then I think I'll probably need to revert todepending on all of tika-parsers - otherwise how would I know whichadditional parser modules I should add ? If this reasoning is right thenone can only use individual modules in the production if it iswell-known the files to be processed will have no unexpected formatsembedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more'helper' modules for most used formats, which would offer less thantika-parsers but more than individual modules, for example:


this is what 2.x already has:


tika-parser-modules/
  tika-parser-pdf-module
  (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will beneeded to process various PDF attachments ? This list of the extra depswill be based on the accumulated knowledge. Similarly for few other mostused formats

tika-parser-pdf-module-all will be a 'compromise', it will pull moremodules than tika-parser-pdf-module but significantly less than tike-parsers



Cheers, Sergey

PDF with embedded attachments and Tika 2.0 modularity

Reply via email to