Hi Nick

Sorry I haven't responded earlier. Please see a comment below

On 25/11/14 23:11, Nick Burch wrote:
On Mon, 24 Nov 2014, Sergey Beryozkin wrote:
It is an interesting idea, one that can lead to introducing
finer-grained bundles but also providing a mechanism for the
(auto-)generation of the import metadata required by each of the
parser modules. Besides, introducing several smaller bundles that
would group most popular formats is a good one on its own IMHO.

My doubt here is how many of those bundles we'd need to create and if
it will make it easy for users to get a task like "Get a parser for
the format A only, or parsers A and B formats only" done.

My hunch, though I've not yet check if it'll work properly, would be for
something like half a dozen parsers:
  * pdf
  * office
  * audio / video
  * html
  * xml-based (xml, odf, epub, atom)
  * scientific
  * everything else

Are we talking about introducing a parser module per every supported
format, and having tika-parsers depend on all of those modules, with
every parser module becoming a bundle (a jar plus an entry in the
manifest) ?

As a first step, I thought we'd still keep the same tika-parser jar, the
only difference would be what dependencies ended up in the bundle. If
the tika-bundle-pdf has no POI jars included in it, then the Microsoft
Office related parsers shouldn't register themselves.

It would mean that the "pdf bundle" would have the image, microsoft etc
parser code in them, but the parsers wouldn't be registered as their
dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down
bundles + cut-down-maven-docs, without needing to change anything else.
If it proves popular, we can then re-visit the "giant tika parsers"
question, but if not it shouldn't change anything. Well, that's the
theory... :)


Sorry if I haven't completely understood the idea, I think there's definitely something nice being suggested above, and it sounds to me as if the following can be one possible realization of it, as a first step for example, - add a tika-pdf module, this will be a bundle, so it will work as a jar and as an OSGI bundle; the code for tika-pdf will be extracted (and removed) from tika-parsers - tika-parsers will get updated to depend on tika-pdf - hence users working with tika-parsers won;t be affected - those users who want working with PDF only would ad tika-core + tika-pdf dependencies only - once we see it works we repeat the process for few more mainstream formats as you suggested above (XHTML, audio+video, etc), with tika-parsers being gradually minimized but still playing the role of the everything-else container

Do you see the above being at least somewhat consistent with what you suggested ? Would it work ?

Cheers, Sergey

Nick


Reply via email to