[DISCUSS] A more modular parser project

Bob Paulin Sat, 25 Jul 2015 18:07:06 -0700

Hi,

At Apache Con this year there was some discussion of some woes with OSGiand Tika.

The TL;DR version of this email is I think the pain with tika-bundle isdue to it's attempt to encompass all the parsers in one Uber jar. Ithink breaking down the tika-bundle into smaller jars (like ApacheCamel) might provide a more maintainable approach going forward for OSGiand non-OSGi developers alike.


I've put up a straw man implementation of smaller bundles here:

https://github.com/bobpaulin/tika

(specifically inhttps://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles)

For the impatient please check out the code above. The approach haszero impact on any of the current tika jar files so nothing would haveto change for current users but they could chose to move to the newapproach if it better fit their needs. It's a straw man so feel free tobeat it up :).


For the rest that have time for an explanation....
The straw man implementation does the following:

1) Copies and inlines the tika-core.jar in the tika-osgi-bundle. (seepom.xml)

2) The tika-osgi-bundle registers a Tika Service that contains a DefaultParser that is composed of all Tika Parsers registered as OSGi servicesas well as available Detectors. This extends the functionality in thetika-core TikaActivator class.

3) Copies and inlines classes from the tika-parser project into specificbundles (see individual pom.xml specifically in the maven-bundle-plugin).

4) Each bundle registers services for each parser and providesconfiguration for ranking (see individual Activator.java). This couldexpand considerably if we wanted to provide additional properties totoggle features on and off.

This provides the following advantages to users and developers in bothan OSGi and standard deployment:

1) These bundles could be evolved separately. The community could drivehow finely grained the bundles were. For example I bundled the imageand jpeg packages together but put tesseract ocr separate.

2) Users can just select the parsers they want to include in thereprojects.

3) Each parser project could maintain it's own optional and embeddedbundle dependencies. Currently the tika-bundle project has a marathonof these entries with optional and embedded dependencies clocking in at~200 line and ~20 lines respectively.

If you made it this far thanks for reading :). As an OSGi developer Ithink the implementation provides a better experience but could alsomake things better for non-OSGi developers as well. And best of all itworks without requiring changes to any of the existing released jars.Let me know what you think.



- Bob Paulin

[DISCUSS] A more modular parser project

Reply via email to