Re: Subsets of tika parsers redux

Nick Burch Tue, 25 Nov 2014 15:12:07 -0800

On Mon, 24 Nov 2014, Sergey Beryozkin wrote:

It is an interesting idea, one that can lead to introducing finer-grainedbundles but also providing a mechanism for the (auto-)generation of theimport metadata required by each of the parser modules. Besides, introducingseveral smaller bundles that would group most popular formats is a good oneon its own IMHO.
My doubt here is how many of those bundles we'd need to create and if it willmake it easy for users to get a task like "Get a parser for the format Aonly, or parsers A and B formats only" done.

My hunch, though I've not yet check if it'll work properly, would be forsomething like half a dozen parsers:

 * pdf
 * office
 * audio / video
 * html
 * xml-based (xml, odf, epub, atom)
 * scientific
 * everything else

Are we talking about introducing a parser module per every supportedformat, and having tika-parsers depend on all of those modules, withevery parser module becoming a bundle (a jar plus an entry in themanifest) ?

As a first step, I thought we'd still keep the same tika-parser jar, theonly difference would be what dependencies ended up in the bundle. If thetika-bundle-pdf has no POI jars included in it, then the Microsoft Officerelated parsers shouldn't register themselves.

It would mean that the "pdf bundle" would have the image, microsoft etcparser code in them, but the parsers wouldn't be registered as theirdependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-downbundles + cut-down-maven-docs, without needing to change anything else. Ifit proves popular, we can then re-visit the "giant tika parsers" question,but if not it shouldn't change anything. Well, that's the theory... :)


Nick

Re: Subsets of tika parsers redux

Reply via email to