Hey Nick,

Thanks for the thoughts. Just to clear a few things up. The version of the app on my github does already include all the parsers as the current app does. If you build it and run --list-parsers you'll see them there. As for the desire to quickly test new bits I think much of the OSGi stuff has been abstracted away. For an example see the example folder [1]. The only additions are the Activator class (which is identical for all the current bundles) and the maven-bundle-plugin in the pom.xml. But don't take my word for it why not give it a spin?

As for the use cases I would say consider whenever we upgrade or add parsers/detectors/encodingdetectors/languagedetectors we .may introduce new dependencies or new versions. For example the pom for the tika-app currently pulls in 3 different versions of commons-io, 2 versions of commons-codec, 2 versions of Guava. Maven resolves to just one version in the final build but the effect is that every part of the code must work with the selected version. In the OSGi version of tika-app the modules can have different versions of the dependencies within the same app. Also within TIKA-1285 [2] it could have been possible to support 2 different versions of PDFBox within different OSGi bundles. So I see it as more of a gain but I'd be interesting in hearing if there is any degradation in the development experience.


- Bob


[1] https://github.com/bobpaulin/tika-app-osgi/tree/master/examples/dummy-parser-bundle

[2] https://issues.apache.org/jira/browse/TIKA-1285


On 9/13/2016 3:38 PM, Nick Burch wrote:
On Sun, 11 Sep 2016, Bob Paulin wrote:
I'd like to propose a new Tika App for the 2.0 branch. One of the reasons we broke apart the Tika parsers into modules was due to the complexity of having to deal with all the parser dependencies and transitive dependencies. Now developers can use just the modules they want without pulling the kitchen sink with it. Unfortunately this approach doesn't simplify the problem in the tika-parser or tika-app project where the whole kitchen sink comes together again.

One of the nice things about the tika app (and server) is you do get everything, so it's very easy to test and get started with!

Another nice thing is that you can test small changes (eg a new parser or a new mime type) quite quickly, just by using the tika app jar on your classpath along with your customisation. Makes it very easy to try out new things if you're a new developer, and I find usually easier than firing up eclipe if I just want to try a new mime type change for someone.


More modular versions of the Tika server I could certainly get behind, if we haven't already done so!

For the app, are there that many use cases for it where you might only want some of Tika? (Most people calling Tika from another language would likely be better off with the server, to avoid the JVM start/stop overhead).

Would the new osgi version make it harder for people to test new bits with tika? For one example, whenever we've done a hackathon and are helping people with a new parser, helping them get their new parser used with just the app is about do-able. I fear if we made them also learn osgi + build a bundle, at that stage when they're trying to do a "hello world", we'd loose them :/

The github project does look interesting though! I'd hate for us to get a few shiny new bits, but loose some key bits important for newbies / quick-win developers in the process though...

Nick


Reply via email to