RE: [DISCUSS] A more modular parser project

Allison, Timothy B. Mon, 03 Aug 2015 11:57:07 -0700

Bob,
  Thank you for leading this effort.  Please continue to forgive my lack of 
OSGi knowledge in the following :).
I just built your strawman, and it looks great...I think(?)


 To confirm that I understand correctly...

  For non-OSGi folks, let's say I know that I'm only going to be parsing 
images, to replicate the features of tika-app but with only the image parsers, 
would I put these three jars in a bin directory (or add them via maven to my 
project):

original-tika-app-1.10-SNAPSHOT.jar
tika-image-parser-bundle-1.10-SNAPSHOT.jar
tika-core-1.10-SNAPSHOT.jar

 and then type:
 java -cp "bin/*" org.apache.tika.cli.TikaCLI -then -the -commandline -options


 If this is the goal, why is tika-parsers "provided" in 
tika-image-parser-bundle's pom?  Aren't all dependencies of the image parsers 
packaged within that bundle?

When I actually tried this, I got a noclassdeferror for Apache logging.   
Should we add that within the image-parser-bundle?  Or should I be doing 
something else (answer= very likely :) ).

Thank you, again.

           Best,

                    Tim


-----Original Message-----
From: Bob Paulin [mailto:[email protected]] 
Sent: Saturday, July 25, 2015 9:07 PM
To: [email protected]
Subject: [DISCUSS] A more modular parser project

Hi,

At Apache Con this year there was some discussion of some woes with OSGi 
and Tika.

The TL;DR version of this email is I think the pain with tika-bundle is 
due to it's attempt to encompass all the parsers in one Uber jar.  I 
think breaking down the tika-bundle into smaller jars (like Apache 
Camel) might provide a more maintainable approach going forward for OSGi 
and non-OSGi developers alike.

I've put up a straw man implementation of smaller bundles here:

https://github.com/bobpaulin/tika
(specifically in 
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles)

For the impatient please check out the code above.  The approach has 
zero impact on any of the current tika jar files so nothing would have 
to change for current users but they could chose to move to the new 
approach if it better fit their needs.  It's a straw man so feel free to 
beat it up :).

For the rest that have time for an explanation....
The straw man implementation does the following:

1) Copies and inlines the tika-core.jar in the tika-osgi-bundle. (see 
pom.xml)

2) The tika-osgi-bundle registers a Tika Service that contains a Default 
Parser that is composed of all Tika Parsers registered as OSGi services 
as well as available Detectors.  This extends the functionality in the 
tika-core TikaActivator class.

3) Copies and inlines classes from the tika-parser project into specific 
bundles (see individual pom.xml specifically in the maven-bundle-plugin).

4) Each bundle registers services for each parser and provides 
configuration for ranking (see individual Activator.java).  This could 
expand considerably if we wanted to provide additional properties to 
toggle features on and off.

This provides the following advantages to users and developers in both 
an OSGi and standard deployment:

1) These bundles could be evolved separately.  The community could drive 
how finely grained the bundles were.  For example I bundled the image 
and jpeg packages together but put tesseract ocr separate.

2) Users can just select the parsers they want to include in there 
projects.

3) Each parser project could maintain it's own optional and embedded 
bundle dependencies.  Currently the tika-bundle project has a marathon 
of these entries with optional and embedded dependencies clocking in at 
~200 line and ~20 lines respectively.

If you made it this far thanks for reading :).   As an OSGi developer I 
think the implementation provides a better experience but could also 
make things better for non-OSGi developers as well. And best of all it 
works without requiring changes to any of the existing released jars.  
Let me know what you think.


- Bob Paulin

RE: [DISCUSS] A more modular parser project

Reply via email to