Hey Tim,

This is a really good question on the provided scope.  In maven I need to
have some sort of dependency on tika-parsers so that I can inline some of
the classes from in the jar.  If we actually moved the source files out of
tika-parsers and into the modules this would not be required.


In the OSGi environment I provided my own logging in the runtime but in a
standard deployment (like what you're doing with tika app) I would assume
that logging would be pulled in as a transitive dependency.  If this was
happening automatically before perhaps there's something I need to add to
image parser.  Can you provide the source for your test?

- Bob

On Mon, Aug 3, 2015 at 1:56 PM, Allison, Timothy B. <[email protected]>
wrote:

> Bob,
>   Thank you for leading this effort.  Please continue to forgive my lack
> of OSGi knowledge in the following :).
> I just built your strawman, and it looks great...I think(?)
>
>  To confirm that I understand correctly...
>
>   For non-OSGi folks, let's say I know that I'm only going to be parsing
> images, to replicate the features of tika-app but with only the image
> parsers, would I put these three jars in a bin directory (or add them via
> maven to my project):
>
> original-tika-app-1.10-SNAPSHOT.jar
> tika-image-parser-bundle-1.10-SNAPSHOT.jar
> tika-core-1.10-SNAPSHOT.jar
>
>  and then type:
>  java -cp "bin/*" org.apache.tika.cli.TikaCLI -then -the -commandline
> -options
>
>
>  If this is the goal, why is tika-parsers "provided" in
> tika-image-parser-bundle's pom?  Aren't all dependencies of the image
> parsers packaged within that bundle?
>
> When I actually tried this, I got a noclassdeferror for Apache logging.
>  Should we add that within the image-parser-bundle?  Or should I be doing
> something else (answer= very likely :) ).
>
> Thank you, again.
>
>            Best,
>
>                     Tim
>
>
> -----Original Message-----
> From: Bob Paulin [mailto:[email protected]]
> Sent: Saturday, July 25, 2015 9:07 PM
> To: [email protected]
> Subject: [DISCUSS] A more modular parser project
>
> Hi,
>
> At Apache Con this year there was some discussion of some woes with OSGi
> and Tika.
>
> The TL;DR version of this email is I think the pain with tika-bundle is
> due to it's attempt to encompass all the parsers in one Uber jar.  I
> think breaking down the tika-bundle into smaller jars (like Apache
> Camel) might provide a more maintainable approach going forward for OSGi
> and non-OSGi developers alike.
>
> I've put up a straw man implementation of smaller bundles here:
>
> https://github.com/bobpaulin/tika
> (specifically in
> https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles)
>
> For the impatient please check out the code above.  The approach has
> zero impact on any of the current tika jar files so nothing would have
> to change for current users but they could chose to move to the new
> approach if it better fit their needs.  It's a straw man so feel free to
> beat it up :).
>
> For the rest that have time for an explanation....
> The straw man implementation does the following:
>
> 1) Copies and inlines the tika-core.jar in the tika-osgi-bundle. (see
> pom.xml)
>
> 2) The tika-osgi-bundle registers a Tika Service that contains a Default
> Parser that is composed of all Tika Parsers registered as OSGi services
> as well as available Detectors.  This extends the functionality in the
> tika-core TikaActivator class.
>
> 3) Copies and inlines classes from the tika-parser project into specific
> bundles (see individual pom.xml specifically in the maven-bundle-plugin).
>
> 4) Each bundle registers services for each parser and provides
> configuration for ranking (see individual Activator.java).  This could
> expand considerably if we wanted to provide additional properties to
> toggle features on and off.
>
> This provides the following advantages to users and developers in both
> an OSGi and standard deployment:
>
> 1) These bundles could be evolved separately.  The community could drive
> how finely grained the bundles were.  For example I bundled the image
> and jpeg packages together but put tesseract ocr separate.
>
> 2) Users can just select the parsers they want to include in there
> projects.
>
> 3) Each parser project could maintain it's own optional and embedded
> bundle dependencies.  Currently the tika-bundle project has a marathon
> of these entries with optional and embedded dependencies clocking in at
> ~200 line and ~20 lines respectively.
>
> If you made it this far thanks for reading :).   As an OSGi developer I
> think the implementation provides a better experience but could also
> make things better for non-OSGi developers as well. And best of all it
> works without requiring changes to any of the existing released jars.
> Let me know what you think.
>
>
> - Bob Paulin
>

Reply via email to