Re: PDF with embedded attachments and Tika 2.0 modularity

Sergey Beryozkin Fri, 16 Sep 2016 14:13:35 -0700

Hi Bob

Thanks for all the info, much appreciated. I agree it makes sense tostart with the multimedia bundle to increase the format coverage.I'll keep experimenting. The demo is quite basic as far as Tika isconcerned - I've only tried PDF/ODT/ODP files and looks like they arereally simple files. But I'd like the simple Tika code I already havekeep working even when some ODP with some video or PDF with some imageetc keeps working. I have few PDFs in mind - we'll try them.

I've been with Tika for a while now but my contributions were mostlylimited to TikaJAXRS. I did with my colleague the initial Tika CXFintegration (which got improved a bit after a feedback from Tim).

But as a Tika API user I'm an enthusiastic beginner :-)

Thanks for your help

Sergey
On 16/09/16 20:49, Bob Paulin wrote:

Hi Sergey,


On 9/15/2016 3:33 PM, Sergey Beryozkin wrote:

Hi Bob, Tim, All,
On 15/09/16 18:06, Bob Paulin wrote:

Hi Sergey,

I definitely get the challenges.  In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2].  We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
play.  Like Tim I'm hesitant to change structure but we definitely need
to address how we handle embedded parsers.  I've done some work with the
ParserProxy class to remove some of the hard dependencies between
parsers.  With that we  only pull in the parsers available on the class
path.  There an example in the JackcessExtractor class in the office
module.

What is the motivation behind excluding the other parsers in your
usecase?  Smaller footprint?  Incompatibility?  Performance?

Depending on the the driver there may be other ways to get you to a
similar place.

Smaller footprint

This is the one, it is not a big deal to have all of tika-parsers
included in my demo, but I've been curious how the smaller footprint
can indeed be achieved in Tika 2.x given it already does the best
effort at supporting more modular Tika applications...

Totally makes sense.  I think you'll end up getting most of what you
need by just pulling in the tika-parser-multimedia-module.  It's already
got all the image parsers for embedded images and TesseractOCR so you
can take your demo as far as reading all the images and converting some
of the images to text if you have Tesseract installed.

You could just include the modules you need and any embedded parsers
from other modules could be added via a ParserProxy.  This might not
remove all the parsers you don't need but might be a good start.


I haven't heard of ParserProxy yet, sorry :-). As a Tika user I'm just
learning. How would one use ParserProxy to minimize the dependencies ?
Just found
https://issues.apache.org/jira/browse/TIKA-1904

Sorry I took you for a Tika veteran based on your concerns for embedded
parsers!  The ParserProxy is new in 2.x and  would actually not need to
worry about it for coding your demo or a client application.  It more
for the framework to allow the modules to compile without parsers from
other modules on the classpath.  It pulls them in via reflection at
runtime or if they are not present fallsback to a no-op.

The most trimmed down way is what you've provided below in your example
creating a tika-parser-pdf-module-all.  I'm concerned about the number
of combinations we might end up creating.

Sure, if such an option would ever be considered then I'd imagine
there would have to be a limit set. Ex, 5 most widely used formats
which may have embedded attachments would have an extra module support
(core parser like PDF parser plus the support parsers for the embedded
attachments).


I agree that a limit would be needed  Would it make sense to hold on
including them in Tika for now and see if some popular combinations
emerge?  Your demo is a great first step to get some feedback; I think
we need more in order to ensure we're making the correct combinations.


But I'm OK with selecting the individual parser modules that may be
needed to have a nearly complete PDF parsing coverage, as long as I
know which modules I have to select :-)

Yes lets start with the multimedia module.  I think you'll get quite a
bit of cool things within that.  Tim do you know of any other modules
that would make sense?

Incompatibility

You might want to look at the tika-parser-bundle projects since putting
the modules in an OSGi container will allow you isolate the
classloaders.

Performance

A combination of the above or you might look to include a
tika-config.xml and just exclude the parsers you don't want. That
should prevent them from being a part of your pipeline.

Other ideas on this?  I think it's an important thing to discuss.

Many thanks, Sergey

Thank you for the feedback!


- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059


On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:

Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without
advanced or scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely
document this well, though!

-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Thursday, September 15, 2016 12:15 PM
To: [email protected]
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort
of embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a
nice option for users to pick up only individual parsers. So I've
added PDFParser & OpenDocumentParser and tike-core to the project
dependencies and all works very nice when I submit to the demo a
simple PDF.

But if I were to write the code which can handle the embedded
attachments really well then I think I'll probably need to revert to
depending on all of tika-parsers - otherwise how would I know which
additional parser modules I should add ? If this reasoning is right
then one can only use individual modules in the production if it is
well-known the files to be processed will have no unexpected formats
embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more
'helper' modules for most used formats, which would offer less than
tika-parsers but more than individual modules, for example:

this is what 2.x already has:


tika-parser-modules/
    tika-parser-pdf-module
    (individual parser modules for the most used ones)

tika-parsers
(all of the parsers)

and now add:

tika-parser-pdf-module-all
(or similarly named)

this

tika-parser-pdf-module-all

will depend on tika-parser-pdf-module plus the parsers which will be
needed to process various PDF attachments ? This list of the extra
deps will be based on the accumulated knowledge. Similarly for few
other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more
modules than tika-parser-pdf-module but significantly less than
tike-parsers


Cheers, Sergey



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: PDF with embedded attachments and Tika 2.0 modularity

Reply via email to