Responses inline.
- Bob
On 8/3/2015 6:58 PM, Allison, Timothy B. wrote:
If we actually moved the source files out of
tika-parsers and into the modules this would not be required.
Of course, got it...now. Thank you. In Tika 2.0, would we want to actually
move the source code for the bundles into the bundles, I wonder. Or, would we
still want to have one massive tika-parsers module as we do now?
+1 to moving the source to bundles. I think for a 2.0 would be easier
to consolidate into a parser uber jar than trying to tease things out
like I did in the straw man impl. However deciding how to break things
up might take some experimentation.
I added the Logging Jars to your test and I discovered 2 things.
1) To spin up the GUI you need org.apache.tika.parser.util (perhaps
consider moving this up to core).
2) Since the META-INF/services/org.apache.tika.parser.Parser is in
tika-parser we'd need to rethink the static ServiceLoader strategy to
either always be dynamic or figure out a way to have each jar bring
there own static loader.
Can you provide the source for your test?
Source?! Ha. No, I literally/manually copied the jars into bin... the horror!
I'm sure maven would have added the transitive dependencies. I'll give that a
try. Thank you, again.
You crack me up :).
Best,
Tim
On Mon, Aug 3, 2015 at 1:56 PM, Allison, Timothy B. <[email protected]>
wrote:
Bob,
Thank you for leading this effort. Please continue to forgive my lack
of OSGi knowledge in the following :).
I just built your strawman, and it looks great...I think(?)
To confirm that I understand correctly...
For non-OSGi folks, let's say I know that I'm only going to be parsing
images, to replicate the features of tika-app but with only the image
parsers, would I put these three jars in a bin directory (or add them via
maven to my project):
original-tika-app-1.10-SNAPSHOT.jar
tika-image-parser-bundle-1.10-SNAPSHOT.jar
tika-core-1.10-SNAPSHOT.jar
and then type:
java -cp "bin/*" org.apache.tika.cli.TikaCLI -then -the -commandline
-options
If this is the goal, why is tika-parsers "provided" in
tika-image-parser-bundle's pom? Aren't all dependencies of the image
parsers packaged within that bundle?
When I actually tried this, I got a noclassdeferror for Apache logging.
Should we add that within the image-parser-bundle? Or should I be doing
something else (answer= very likely :) ).
Thank you, again.
Best,
Tim
-----Original Message-----
From: Bob Paulin [mailto:[email protected]]
Sent: Saturday, July 25, 2015 9:07 PM
To: [email protected]
Subject: [DISCUSS] A more modular parser project
Hi,
At Apache Con this year there was some discussion of some woes with OSGi
and Tika.
The TL;DR version of this email is I think the pain with tika-bundle is
due to it's attempt to encompass all the parsers in one Uber jar. I
think breaking down the tika-bundle into smaller jars (like Apache
Camel) might provide a more maintainable approach going forward for OSGi
and non-OSGi developers alike.
I've put up a straw man implementation of smaller bundles here:
https://github.com/bobpaulin/tika
(specifically in
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles)
For the impatient please check out the code above. The approach has
zero impact on any of the current tika jar files so nothing would have
to change for current users but they could chose to move to the new
approach if it better fit their needs. It's a straw man so feel free to
beat it up :).
For the rest that have time for an explanation....
The straw man implementation does the following:
1) Copies and inlines the tika-core.jar in the tika-osgi-bundle. (see
pom.xml)
2) The tika-osgi-bundle registers a Tika Service that contains a Default
Parser that is composed of all Tika Parsers registered as OSGi services
as well as available Detectors. This extends the functionality in the
tika-core TikaActivator class.
3) Copies and inlines classes from the tika-parser project into specific
bundles (see individual pom.xml specifically in the maven-bundle-plugin).
4) Each bundle registers services for each parser and provides
configuration for ranking (see individual Activator.java). This could
expand considerably if we wanted to provide additional properties to
toggle features on and off.
This provides the following advantages to users and developers in both
an OSGi and standard deployment:
1) These bundles could be evolved separately. The community could drive
how finely grained the bundles were. For example I bundled the image
and jpeg packages together but put tesseract ocr separate.
2) Users can just select the parsers they want to include in there
projects.
3) Each parser project could maintain it's own optional and embedded
bundle dependencies. Currently the tika-bundle project has a marathon
of these entries with optional and embedded dependencies clocking in at
~200 line and ~20 lines respectively.
If you made it this far thanks for reading :). As an OSGi developer I
think the implementation provides a better experience but could also
make things better for non-OSGi developers as well. And best of all it
works without requiring changes to any of the existing released jars.
Let me know what you think.
- Bob Paulin