On 9/15/2016 3:33 PM, Sergey Beryozkin wrote:
Totally makes sense. I think you'll end up getting most of what you
need by just pulling in the tika-parser-multimedia-module. It's already
got all the image parsers for embedded images and TesseractOCR so you
can take your demo as far as reading all the images and converting some
of the images to text if you have Tesseract installed.
Hi Bob, Tim, All,
On 15/09/16 18:06, Bob Paulin wrote:
This is the one, it is not a big deal to have all of tika-parsers
included in my demo, but I've been curious how the smaller footprint
can indeed be achieved in Tika 2.x given it already does the best
effort at supporting more modular Tika applications...
I definitely get the challenges. In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR . We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
play. Like Tim I'm hesitant to change structure but we definitely need
to address how we handle embedded parsers. I've done some work with the
ParserProxy class to remove some of the hard dependencies between
parsers. With that we only pull in the parsers available on the class
path. There an example in the JackcessExtractor class in the office
What is the motivation behind excluding the other parsers in your
usecase? Smaller footprint? Incompatibility? Performance?
Depending on the the driver there may be other ways to get you to a
Sorry I took you for a Tika veteran based on your concerns for embedded
parsers! The ParserProxy is new in 2.x and would actually not need to
worry about it for coding your demo or a client application. It more
for the framework to allow the modules to compile without parsers from
other modules on the classpath. It pulls them in via reflection at
runtime or if they are not present fallsback to a no-op.
You could just include the modules you need and any embedded parsers
from other modules could be added via a ParserProxy. This might not
remove all the parsers you don't need but might be a good start.
I haven't heard of ParserProxy yet, sorry :-). As a Tika user I'm just
learning. How would one use ParserProxy to minimize the dependencies ?
Sure, if such an option would ever be considered then I'd imagine
there would have to be a limit set. Ex, 5 most widely used formats
which may have embedded attachments would have an extra module support
(core parser like PDF parser plus the support parsers for the embedded
The most trimmed down way is what you've provided below in your example
creating a tika-parser-pdf-module-all. I'm concerned about the number
of combinations we might end up creating.
I agree that a limit would be needed Would it make sense to hold on
including them in Tika for now and see if some popular combinations
emerge? Your demo is a great first step to get some feedback; I think
we need more in order to ensure we're making the correct combinations.
Yes lets start with the multimedia module. I think you'll get quite a
bit of cool things within that. Tim do you know of any other modules
that would make sense?
But I'm OK with selecting the individual parser modules that may be
needed to have a nearly complete PDF parsing coverage, as long as I
know which modules I have to select :-)
You might want to look at the tika-parser-bundle projects since putting
the modules in an OSGi container will allow you isolate the
A combination of the above or you might look to include a
tika-config.xml and just exclude the parsers you don't want. That
should prevent them from being a part of your pipeline.
Other ideas on this? I think it's an important thing to discuss.
Many thanks, Sergey
Thank you for the feedback!
On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:
Sergey, your point is well taken.
Y, you'd need most parsers, but you can _probably_ live without
advanced or scientific (sorry, Chris!).
I'd be hesitant to change the structure much. We should definitely
document this well, though!
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 15, 2016 12:15 PM
Subject: PDF with embedded attachments and Tika 2.0 modularity
As Tim educated me, PDF (and indeed other formats) may have all sort
of embedded attachments.
In my demo I've been working with Tika 2.0-SNAPSHOT which offers a
nice option for users to pick up only individual parsers. So I've
added PDFParser & OpenDocumentParser and tike-core to the project
dependencies and all works very nice when I submit to the demo a
But if I were to write the code which can handle the embedded
attachments really well then I think I'll probably need to revert to
depending on all of tika-parsers - otherwise how would I know which
additional parser modules I should add ? If this reasoning is right
then one can only use individual modules in the production if it is
well-known the files to be processed will have no unexpected formats
embedded in them...
I've been wondering - would it make sense, for Tika 2.0, add few more
'helper' modules for most used formats, which would offer less than
tika-parsers but more than individual modules, for example:
this is what 2.x already has:
(individual parser modules for the most used ones)
(all of the parsers)
and now add:
(or similarly named)
will depend on tika-parser-pdf-module plus the parsers which will be
needed to process various PDF attachments ? This list of the extra
deps will be based on the accumulated knowledge. Similarly for few
other most used formats
tika-parser-pdf-module-all will be a 'compromise', it will pull more
modules than tika-parser-pdf-module but significantly less than