Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort of embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for users to pick up only individual parsers. So I've added PDFParser & OpenDocumentParser and tike-core to the project dependencies and all works very nice when I submit to the demo a simple PDF.

But if I were to write the code which can handle the embedded attachments really well then I think I'll probably need to revert to depending on all of tika-parsers - otherwise how would I know which additional parser modules I should add ? If this reasoning is right then one can only use individual modules in the production if it is well-known the files to be processed will have no unexpected formats embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more 'helper' modules for most used formats, which would offer less than tika-parsers but more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
  tika-parser-pdf-module
  (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be needed to process various PDF attachments ? This list of the extra deps will be based on the accumulated knowledge. Similarly for few other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more modules than tika-parser-pdf-module but significantly less than tike-parsers


Cheers, Sergey


Reply via email to