Hi,

At Apache Con this year there was some discussion of some woes with OSGi and Tika.

The TL;DR version of this email is I think the pain with tika-bundle is due to it's attempt to encompass all the parsers in one Uber jar. I think breaking down the tika-bundle into smaller jars (like Apache Camel) might provide a more maintainable approach going forward for OSGi and non-OSGi developers alike.

I've put up a straw man implementation of smaller bundles here:

https://github.com/bobpaulin/tika
(specifically in https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles)

For the impatient please check out the code above. The approach has zero impact on any of the current tika jar files so nothing would have to change for current users but they could chose to move to the new approach if it better fit their needs. It's a straw man so feel free to beat it up :).

For the rest that have time for an explanation....
The straw man implementation does the following:

1) Copies and inlines the tika-core.jar in the tika-osgi-bundle. (see pom.xml)

2) The tika-osgi-bundle registers a Tika Service that contains a Default Parser that is composed of all Tika Parsers registered as OSGi services as well as available Detectors. This extends the functionality in the tika-core TikaActivator class.

3) Copies and inlines classes from the tika-parser project into specific bundles (see individual pom.xml specifically in the maven-bundle-plugin).

4) Each bundle registers services for each parser and provides configuration for ranking (see individual Activator.java). This could expand considerably if we wanted to provide additional properties to toggle features on and off.

This provides the following advantages to users and developers in both an OSGi and standard deployment:

1) These bundles could be evolved separately. The community could drive how finely grained the bundles were. For example I bundled the image and jpeg packages together but put tesseract ocr separate.

2) Users can just select the parsers they want to include in there projects.

3) Each parser project could maintain it's own optional and embedded bundle dependencies. Currently the tika-bundle project has a marathon of these entries with optional and embedded dependencies clocking in at ~200 line and ~20 lines respectively.

If you made it this far thanks for reading :). As an OSGi developer I think the implementation provides a better experience but could also make things better for non-OSGi developers as well. And best of all it works without requiring changes to any of the existing released jars. Let me know what you think.


- Bob Paulin

Reply via email to