Hi Sergey,
I definitely get the challenges. In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2]. We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
play. Like Tim I'm hesitant to change structure but we definitely need
to address how we handle embedded parsers. I've done some work with the
ParserProxy class to remove some of the hard dependencies between
parsers. With that we only pull in the parsers available on the class
path. There an example in the JackcessExtractor class in the office module.
What is the motivation behind excluding the other parsers in your
usecase? Smaller footprint? Incompatibility? Performance?
Depending on the the driver there may be other ways to get you to a
similar place.
Smaller footprint
You could just include the modules you need and any embedded parsers
from other modules could be added via a ParserProxy. This might not
remove all the parsers you don't need but might be a good start. The
most trimmed down way is what you've provided below in your example
creating a tika-parser-pdf-module-all. I'm concerned about the number
of combinations we might end up creating.
Incompatibility
You might want to look at the tika-parser-bundle projects since putting
the modules in an OSGi container will allow you isolate the classloaders.
Performance
A combination of the above or you might look to include a
tika-config.xml and just exclude the parsers you don't want. That
should prevent them from being a part of your pipeline.
Other ideas on this? I think it's an important thing to discuss.
- Bob
[1] http://markmail.org/message/e4ncuid7zrvlitp5
[2] https://issues.apache.org/jira/browse/TIKA-2059
On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:
Sergey, your point is well taken.
Y, you'd need most parsers, but you can _probably_ live without advanced or
scientific (sorry, Chris!).
I'd be hesitant to change the structure much. We should definitely document
this well, though!
-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Thursday, September 15, 2016 12:15 PM
To: [email protected]
Subject: PDF with embedded attachments and Tika 2.0 modularity
Hi All
As Tim educated me, PDF (and indeed other formats) may have all sort of
embedded attachments.
In my demo I've been working with Tika 2.0-SNAPSHOT which offers a nice option for
users to pick up only individual parsers. So I've added PDFParser &
OpenDocumentParser and tike-core to the project dependencies and all works very
nice when I submit to the demo a simple PDF.
But if I were to write the code which can handle the embedded attachments
really well then I think I'll probably need to revert to depending on all of
tika-parsers - otherwise how would I know which additional parser modules I
should add ? If this reasoning is right then one can only use individual
modules in the production if it is well-known the files to be processed will
have no unexpected formats embedded in them...
I've been wondering - would it make sense, for Tika 2.0, add few more 'helper'
modules for most used formats, which would offer less than tika-parsers but
more than individual modules, for example:
this is what 2.x already has:
tika-parser-modules/
tika-parser-pdf-module
(individual parser modules for the most used ones)
tika-parsers
(all of the parsers)
and now add:
tika-parser-pdf-module-all
(or similarly named)
this
tika-parser-pdf-module-all
will depend on tika-parser-pdf-module plus the parsers which will be needed to
process various PDF attachments ? This list of the extra deps will be based on
the accumulated knowledge. Similarly for few other most used formats
tika-parser-pdf-module-all will be a 'compromise', it will pull more modules
than tika-parser-pdf-module but significantly less than tike-parsers
Cheers, Sergey