Hi Nick,
On 15/12/14 14:02, Nick Burch wrote:
On Mon, 15 Dec 2014, Sergey Beryozkin wrote:
As a first step, I thought we'd still keep the same tika-parser jar, the
only difference would be what dependencies ended up in the bundle. If
the tika-bundle-pdf has no POI jars included in it, then the Microsoft
Office related parsers shouldn't register themselves.

It would mean that the "pdf bundle" would have the image, microsoft etc
parser code in them, but the parsers wouldn't be registered as their
dependencies wouldn't be there.

Not sure if this can/will work, but it would mean we can do cut-down
bundles + cut-down-maven-docs, without needing to change anything else.
If it proves popular, we can then re-visit the "giant tika parsers"
question, but if not it shouldn't change anything. Well, that's the
theory... :)


Sorry if I haven't completely understood the idea, I think there's
definitely something nice being suggested above, and it sounds to me
as if the following can be one possible realization of it, as a first
step for example,
- add a tika-pdf module, this will be a bundle, so it will work as a
jar and as an OSGI bundle; the code for tika-pdf will be extracted
(and removed) from tika-parsers

Not quite - I forsee this being OSGi only for now. Tika Parsers project
would be unchanged, OSGi users could have tika (all) as now, or just
tika-pdf

- tika-parsers will get updated to depend on tika-pdf - hence users
working with tika-parsers won;t be affected

No, that's a possible phase 2 if it goes well. No change for non-OSGi
stuff. Non-OSGi users can see the OSGi build to work out what to include
and exclude if they want. (This means that we have a unit tested way to
see what you do/don't want, without affecting things for the simple Tika
users we get confused already with tika-core + tika-parsers)

- those users who want working with PDF only would ad tika-core +
tika-pdf dependencies only

OSGi users would pick tika + tika-parsers, or tika + tika-parsers-pdf,
or tika + tika-parsers-pdf + tika-parsers-mp3 if they want


OSGi is nicely contained, and fairly easy to unit test, so let's use
that to test out the idea! That also solves the CXF need. Once that
works, and once we have a tested way that everyone can see + understand,
then someone can try to make the case for phase II where we push it to
the maven pom / project level!
The need of CXF (Tika) users (or of some other users with possibly similar requirements) is not about shipping OSGI only Tika modules but about having an easy option of not to having include all the tika-parsers. Some CXF users would work with OSGI, some not. Sorry if I did not clarify it.

As I said, a module marked as "bundle", as opposed to a default 'jar' is just a plain jar with few extra META-INF instructions.

Given it, I'm not understanding why you are opposed to not having tika-parsers minimized as I suggested ? What exactly is your concern ?

Shipping something like tika-pdf but still keeping the PDF parsing code inside tika-parsers is a duplication, right ?

Thanks, Sergey





Nick

Reply via email to