On Mon, 24 Nov 2014, Sergey Beryozkin wrote:
It is an interesting idea, one that can lead to introducing finer-grained bundles but also providing a mechanism for the (auto-)generation of the import metadata required by each of the parser modules. Besides, introducing several smaller bundles that would group most popular formats is a good one on its own IMHO.

My doubt here is how many of those bundles we'd need to create and if it will make it easy for users to get a task like "Get a parser for the format A only, or parsers A and B formats only" done.

My hunch, though I've not yet check if it'll work properly, would be for something like half a dozen parsers:
 * pdf
 * office
 * audio / video
 * html
 * xml-based (xml, odf, epub, atom)
 * scientific
 * everything else

Are we talking about introducing a parser module per every supported format, and having tika-parsers depend on all of those modules, with every parser module becoming a bundle (a jar plus an entry in the manifest) ?

As a first step, I thought we'd still keep the same tika-parser jar, the only difference would be what dependencies ended up in the bundle. If the tika-bundle-pdf has no POI jars included in it, then the Microsoft Office related parsers shouldn't register themselves.

It would mean that the "pdf bundle" would have the image, microsoft etc parser code in them, but the parsers wouldn't be registered as their dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down bundles + cut-down-maven-docs, without needing to change anything else. If it proves popular, we can then re-visit the "giant tika parsers" question, but if not it shouldn't change anything. Well, that's the theory... :)

Nick

Reply via email to