Hi Sergey,

On 9/15/2016 3:33 PM, Sergey Beryozkin wrote:
Hi Bob, Tim, All,
On 15/09/16 18:06, Bob Paulin wrote:
Hi Sergey,

I definitely get the challenges.  In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2].  We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
play.  Like Tim I'm hesitant to change structure but we definitely need
to address how we handle embedded parsers.  I've done some work with the
ParserProxy class to remove some of the hard dependencies between
parsers.  With that we  only pull in the parsers available on the class
path.  There an example in the JackcessExtractor class in the office

What is the motivation behind excluding the other parsers in your
usecase?  Smaller footprint?  Incompatibility?  Performance?

Depending on the the driver there may be other ways to get you to a
similar place.

Smaller footprint

This is the one, it is not a big deal to have all of tika-parsers included in my demo, but I've been curious how the smaller footprint can indeed be achieved in Tika 2.x given it already does the best effort at supporting more modular Tika applications...
Totally makes sense. I think you'll end up getting most of what you need by just pulling in the tika-parser-multimedia-module. It's already got all the image parsers for embedded images and TesseractOCR so you can take your demo as far as reading all the images and converting some of the images to text if you have Tesseract installed.

You could just include the modules you need and any embedded parsers
from other modules could be added via a ParserProxy.  This might not
remove all the parsers you don't need but might be a good start.

I haven't heard of ParserProxy yet, sorry :-). As a Tika user I'm just learning. How would one use ParserProxy to minimize the dependencies ?
Just found

Sorry I took you for a Tika veteran based on your concerns for embedded parsers! The ParserProxy is new in 2.x and would actually not need to worry about it for coding your demo or a client application. It more for the framework to allow the modules to compile without parsers from other modules on the classpath. It pulls them in via reflection at runtime or if they are not present fallsback to a no-op.

The most trimmed down way is what you've provided below in your example
creating a tika-parser-pdf-module-all.  I'm concerned about the number
of combinations we might end up creating.

Sure, if such an option would ever be considered then I'd imagine there would have to be a limit set. Ex, 5 most widely used formats which may have embedded attachments would have an extra module support (core parser like PDF parser plus the support parsers for the embedded attachments).

I agree that a limit would be needed Would it make sense to hold on including them in Tika for now and see if some popular combinations emerge? Your demo is a great first step to get some feedback; I think we need more in order to ensure we're making the correct combinations.

But I'm OK with selecting the individual parser modules that may be needed to have a nearly complete PDF parsing coverage, as long as I know which modules I have to select :-)

Yes lets start with the multimedia module. I think you'll get quite a bit of cool things within that. Tim do you know of any other modules that would make sense?

You might want to look at the tika-parser-bundle projects since putting
the modules in an OSGi container will allow you isolate the classloaders.


A combination of the above or you might look to include a
tika-config.xml and just exclude the parsers you don't want. That
should prevent them from being a part of your pipeline.

Other ideas on this?  I think it's an important thing to discuss.

Many thanks, Sergey

Thank you for the feedback!

- Bob

[1] http://markmail.org/message/e4ncuid7zrvlitp5

[2] https://issues.apache.org/jira/browse/TIKA-2059

On 9/15/2016 11:20 AM, Allison, Timothy B. wrote:
Sergey, your point is well taken.

Y, you'd need most parsers, but you can _probably_ live without
advanced or scientific (sorry, Chris!).

I'd be hesitant to change the structure much.  We should definitely
document this well, though!

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 15, 2016 12:15 PM
To: dev@tika.apache.org
Subject: PDF with embedded attachments and Tika 2.0 modularity

Hi All

As Tim educated me, PDF (and indeed other formats) may have all sort
of embedded attachments.

In my demo I've been working with Tika 2.0-SNAPSHOT which offers a
nice option for users to pick up only individual parsers. So I've
added PDFParser & OpenDocumentParser and tike-core to the project
dependencies and all works very nice when I submit to the demo a
simple PDF.

But if I were to write the code which can handle the embedded
attachments really well then I think I'll probably need to revert to
depending on all of tika-parsers - otherwise how would I know which
additional parser modules I should add ? If this reasoning is right
then one can only use individual modules in the production if it is
well-known the files to be processed will have no unexpected formats
embedded in them...

I've been wondering - would it make sense, for Tika 2.0, add few more
'helper' modules for most used formats, which would offer less than
tika-parsers but more than individual modules, for example:

this is what 2.x already has:

    (individual parser modules for the most used ones)

(all of the parsers)

and now add:

(or similarly named)



will depend on tika-parser-pdf-module plus the parsers which will be
needed to process various PDF attachments ? This list of the extra
deps will be based on the accumulated knowledge. Similarly for few
other most used formats

tika-parser-pdf-module-all will be a 'compromise', it will pull more
modules than tika-parser-pdf-module but significantly less than

Cheers, Sergey

Reply via email to