On Mon, 24 Nov 2014, Sergey Beryozkin wrote:
It is an interesting idea, one that can lead to introducing finer-grained
bundles but also providing a mechanism for the (auto-)generation of the
import metadata required by each of the parser modules. Besides, introducing
several smaller bundles that would group most popular formats is a good one
on its own IMHO.
My doubt here is how many of those bundles we'd need to create and if it will
make it easy for users to get a task like "Get a parser for the format A
only, or parsers A and B formats only" done.
My hunch, though I've not yet check if it'll work properly, would be for
something like half a dozen parsers:
* pdf
* office
* audio / video
* html
* xml-based (xml, odf, epub, atom)
* scientific
* everything else
Are we talking about introducing a parser module per every supported
format, and having tika-parsers depend on all of those modules, with
every parser module becoming a bundle (a jar plus an entry in the
manifest) ?
As a first step, I thought we'd still keep the same tika-parser jar, the
only difference would be what dependencies ended up in the bundle. If the
tika-bundle-pdf has no POI jars included in it, then the Microsoft Office
related parsers shouldn't register themselves.
It would mean that the "pdf bundle" would have the image, microsoft etc
parser code in them, but the parsers wouldn't be registered as their
dependencies wouldn't be there.
Not sure if this can/will work, but it would mean we can do cut-down
bundles + cut-down-maven-docs, without needing to change anything else. If
it proves popular, we can then re-visit the "giant tika parsers" question,
but if not it shouldn't change anything. Well, that's the theory... :)
Nick