Hey Tim, This is a really good question on the provided scope. In maven I need to have some sort of dependency on tika-parsers so that I can inline some of the classes from in the jar. If we actually moved the source files out of tika-parsers and into the modules this would not be required.
In the OSGi environment I provided my own logging in the runtime but in a standard deployment (like what you're doing with tika app) I would assume that logging would be pulled in as a transitive dependency. If this was happening automatically before perhaps there's something I need to add to image parser. Can you provide the source for your test? - Bob On Mon, Aug 3, 2015 at 1:56 PM, Allison, Timothy B. <[email protected]> wrote: > Bob, > Thank you for leading this effort. Please continue to forgive my lack > of OSGi knowledge in the following :). > I just built your strawman, and it looks great...I think(?) > > To confirm that I understand correctly... > > For non-OSGi folks, let's say I know that I'm only going to be parsing > images, to replicate the features of tika-app but with only the image > parsers, would I put these three jars in a bin directory (or add them via > maven to my project): > > original-tika-app-1.10-SNAPSHOT.jar > tika-image-parser-bundle-1.10-SNAPSHOT.jar > tika-core-1.10-SNAPSHOT.jar > > and then type: > java -cp "bin/*" org.apache.tika.cli.TikaCLI -then -the -commandline > -options > > > If this is the goal, why is tika-parsers "provided" in > tika-image-parser-bundle's pom? Aren't all dependencies of the image > parsers packaged within that bundle? > > When I actually tried this, I got a noclassdeferror for Apache logging. > Should we add that within the image-parser-bundle? Or should I be doing > something else (answer= very likely :) ). > > Thank you, again. > > Best, > > Tim > > > -----Original Message----- > From: Bob Paulin [mailto:[email protected]] > Sent: Saturday, July 25, 2015 9:07 PM > To: [email protected] > Subject: [DISCUSS] A more modular parser project > > Hi, > > At Apache Con this year there was some discussion of some woes with OSGi > and Tika. > > The TL;DR version of this email is I think the pain with tika-bundle is > due to it's attempt to encompass all the parsers in one Uber jar. I > think breaking down the tika-bundle into smaller jars (like Apache > Camel) might provide a more maintainable approach going forward for OSGi > and non-OSGi developers alike. > > I've put up a straw man implementation of smaller bundles here: > > https://github.com/bobpaulin/tika > (specifically in > https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles) > > For the impatient please check out the code above. The approach has > zero impact on any of the current tika jar files so nothing would have > to change for current users but they could chose to move to the new > approach if it better fit their needs. It's a straw man so feel free to > beat it up :). > > For the rest that have time for an explanation.... > The straw man implementation does the following: > > 1) Copies and inlines the tika-core.jar in the tika-osgi-bundle. (see > pom.xml) > > 2) The tika-osgi-bundle registers a Tika Service that contains a Default > Parser that is composed of all Tika Parsers registered as OSGi services > as well as available Detectors. This extends the functionality in the > tika-core TikaActivator class. > > 3) Copies and inlines classes from the tika-parser project into specific > bundles (see individual pom.xml specifically in the maven-bundle-plugin). > > 4) Each bundle registers services for each parser and provides > configuration for ranking (see individual Activator.java). This could > expand considerably if we wanted to provide additional properties to > toggle features on and off. > > This provides the following advantages to users and developers in both > an OSGi and standard deployment: > > 1) These bundles could be evolved separately. The community could drive > how finely grained the bundles were. For example I bundled the image > and jpeg packages together but put tesseract ocr separate. > > 2) Users can just select the parsers they want to include in there > projects. > > 3) Each parser project could maintain it's own optional and embedded > bundle dependencies. Currently the tika-bundle project has a marathon > of these entries with optional and embedded dependencies clocking in at > ~200 line and ~20 lines respectively. > > If you made it this far thanks for reading :). As an OSGi developer I > think the implementation provides a better experience but could also > make things better for non-OSGi developers as well. And best of all it > works without requiring changes to any of the existing released jars. > Let me know what you think. > > > - Bob Paulin >
