Hi All
As Tim educated me, PDF (and indeed other formats) may have all sort of
embedded attachments.
In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice
option for users to pick up only individual parsers. So I've added
PDFParser & OpenDocumentParser and tike-core to the project dependencies
and all works very nice when I submit to the demo a simple PDF.
But if I were to write the code which can handle the embedded
attachments really well then I think I'll probably need to revert to
depending on all of tika-parsers - otherwise how would I know which
additional parser modules I should add ? If this reasoning is right then
one can only use individual modules in the production if it is
well-known the files to be processed will have no unexpected formats
embedded in them...
I've been wondering - would it make sense, for Tika 2.0, add few more
'helper' modules for most used formats, which would offer less than
tika-parsers but more than individual modules, for example:
this is what 2.x already has:
tika-parser-modules/
tika-parser-pdf-module
(individual parser modules for the most used ones)
tika-parsers
(all of the parsers)
and now add:
tika-parser-pdf-module-all
(or similarly named)
this
tika-parser-pdf-module-all
will depend on tika-parser-pdf-module plus the parsers which will be
needed to process various PDF attachments ? This list of the extra deps
will be based on the accumulated knowledge. Similarly for few other most
used formats
tika-parser-pdf-module-all will be a 'compromise', it will pull more
modules than tika-parser-pdf-module but significantly less than tike-parsers
Cheers, Sergey