Re: Subsets of tika parsers redux

Sergey Beryozkin Mon, 15 Dec 2014 02:49:44 -0800

Hi Nick

Sorry I haven't responded earlier. Please see a comment below


On 25/11/14 23:11, Nick Burch wrote:

On Mon, 24 Nov 2014, Sergey Beryozkin wrote:

It is an interesting idea, one that can lead to introducing
finer-grained bundles but also providing a mechanism for the
(auto-)generation of the import metadata required by each of the
parser modules. Besides, introducing several smaller bundles that
would group most popular formats is a good one on its own IMHO.

My doubt here is how many of those bundles we'd need to create and if
it will make it easy for users to get a task like "Get a parser for
the format A only, or parsers A and B formats only" done.


My hunch, though I've not yet check if it'll work properly, would be for
something like half a dozen parsers:
  * pdf
  * office
  * audio / video
  * html
  * xml-based (xml, odf, epub, atom)
  * scientific
  * everything else

Are we talking about introducing a parser module per every supported
format, and having tika-parsers depend on all of those modules, with
every parser module becoming a bundle (a jar plus an entry in the
manifest) ?


As a first step, I thought we'd still keep the same tika-parser jar, the
only difference would be what dependencies ended up in the bundle. If
the tika-bundle-pdf has no POI jars included in it, then the Microsoft
Office related parsers shouldn't register themselves.

It would mean that the "pdf bundle" would have the image, microsoft etc
parser code in them, but the parsers wouldn't be registered as their
dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down
bundles + cut-down-maven-docs, without needing to change anything else.
If it proves popular, we can then re-visit the "giant tika parsers"
question, but if not it shouldn't change anything. Well, that's the
theory... :)

Sorry if I haven't completely understood the idea, I think there'sdefinitely something nice being suggested above, and it sounds to me asif the following can be one possible realization of it, as a first stepfor example,- add a tika-pdf module, this will be a bundle, so it will work as a jarand as an OSGI bundle; the code for tika-pdf will be extracted (andremoved) from tika-parsers- tika-parsers will get updated to depend on tika-pdf - hence usersworking with tika-parsers won;t be affected- those users who want working with PDF only would ad tika-core +tika-pdf dependencies only- once we see it works we repeat the process for few more mainstreamformats as you suggested above (XHTML, audio+video, etc), withtika-parsers being gradually minimized but still playing the role of theeverything-else container

Do you see the above being at least somewhat consistent with what yousuggested ? Would it work ?


Cheers, Sergey

Nick

Re: Subsets of tika parsers redux

Reply via email to