Re: PDF with embedded attachments and Tika 2.0 modularity

Bob Paulin Fri, 16 Sep 2016 12:49:50 -0700

Hi Sergey,


On 9/15/2016 3:33 PM, Sergey Beryozkin wrote:

Hi Bob, Tim, All,
On 15/09/16 18:06, Bob Paulin wrote:

Hi Sergey,

I definitely get the challenges.  In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2].  We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
play.  Like Tim I'm hesitant to change structure but we definitely need
to address how we handle embedded parsers.  I've done some work with the
ParserProxy class to remove some of the hard dependencies between
parsers.  With that we  only pull in the parsers available on the class
path.  There an example in the JackcessExtractor class in the office
module.

What is the motivation behind excluding the other parsers in your
usecase?  Smaller footprint?  Incompatibility?  Performance?

Depending on the the driver there may be other ways to get you to a
similar place.

Smaller footprint

This is the one, it is not a big deal to have all of tika-parsersincluded in my demo, but I've been curious how the smaller footprintcan indeed be achieved in Tika 2.x given it already does the besteffort at supporting more modular Tika applications...

Totally makes sense. I think you'll end up getting most of what youneed by just pulling in the tika-parser-multimedia-module. It's alreadygot all the image parsers for embedded images and TesseractOCR so youcan take your demo as far as reading all the images and converting someof the images to text if you have Tesseract installed.

You could just include the modules you need and any embedded parsers
from other modules could be added via a ParserProxy.  This might not
remove all the parsers you don't need but might be a good start.
I haven't heard of ParserProxy yet, sorry :-). As a Tika user I'm justlearning. How would one use ParserProxy to minimize the dependencies ?
Just found
https://issues.apache.org/jira/browse/TIKA-1904

Sorry I took you for a Tika veteran based on your concerns for embeddedparsers! The ParserProxy is new in 2.x and would actually not need toworry about it for coding your demo or a client application. It morefor the framework to allow the modules to compile without parsers fromother modules on the classpath. It pulls them in via reflection atruntime or if they are not present fallsback to a no-op.

The most trimmed down way is what you've provided below in your example
creating a tika-parser-pdf-module-all.  I'm concerned about the number
of combinations we might end up creating.
Sure, if such an option would ever be considered then I'd imaginethere would have to be a limit set. Ex, 5 most widely used formatswhich may have embedded attachments would have an extra module support(core parser like PDF parser plus the support parsers for the embeddedattachments).

I agree that a limit would be needed Would it make sense to hold onincluding them in Tika for now and see if some popular combinationsemerge? Your demo is a great first step to get some feedback; I thinkwe need more in order to ensure we're making the correct combinations.

But I'm OK with selecting the individual parser modules that may beneeded to have a nearly complete PDF parsing coverage, as long as Iknow which modules I have to select :-)

Yes lets start with the multimedia module. I think you'll get quite abit of cool things within that. Tim do you know of any other modulesthat would make sense?

Incompatibility

You might want to look at the tika-parser-bundle projects since putting

the modules in an OSGi container will allow you isolate theclassloaders.


Performance

A combination of the above or you might look to include a
tika-config.xml and just exclude the parsers you don't want. That
should prevent them from being a part of your pipeline.

Other ideas on this?  I think it's an important thing to discuss.

Many thanks, Sergey

Thank you for the feedback!


- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059


On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:

Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without
advanced or scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely
document this well, though!

-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Thursday, September 15, 2016 12:15 PM
To: [email protected]
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort
of embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a
nice option for users to pick up only individual parsers. So I've
added PDFParser & OpenDocumentParser and tike-core to the project
dependencies and all works very nice when I submit to the demo a
simple PDF.

But if I were to write the code which can handle the embedded
attachments really well then I think I'll probably need to revert to
depending on all of tika-parsers - otherwise how would I know which
additional parser modules I should add ? If this reasoning is right
then one can only use individual modules in the production if it is
well-known the files to be processed will have no unexpected formats
embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more
'helper' modules for most used formats, which would offer less than
tika-parsers but more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
    tika-parser-pdf-module
    (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be
needed to process various PDF attachments ? This list of the extra
deps will be based on the accumulated knowledge. Similarly for few
other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more
modules than tika-parser-pdf-module but significantly less than
tike-parsers


Cheers, Sergey

Re: PDF with embedded attachments and Tika 2.0 modularity

Reply via email to