Hey Nick, This sounds like a great plan to me, good job to you and Sergey. As for helping I¹ll try my best, but I¹m not an OSGI guru :)
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Nick Burch <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, November 23, 2014 at 6:12 PM To: "[email protected]" <[email protected]> Subject: Subsets of tika parsers redux >Hi All > >During ApacheCon, I had a chance to chat with Sergey about the "subset of >Tika Parsers" issue that bubbles up from time to time. It seemed to work >well, and I think we both now have a better idea of the other's needs and >concerns, which is good :) > >As is shown on our list from time to time, but more commonly elsewhere, >we >have some users who are confused already by the split between tika-core >and tika-parsers. Anything that fragments further is going to cause more >issues for that kind of user. > >On the other hand, there are potential users out there who want just a >handful of parsers, in a simple and easy and small way, who don't know a >lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those >are using OSGi, but not all. > >One suggested solution is to just document what dependencies of >tika-parsers can be excluded at the maven level to disable certain >parsers >+ shrink the resulting dependency tree. However, that requires manual >updates, manual checking, and like our examples on the website risk >getting out of date without automated checking. > >Discussion then turned to our move to get all the examples for the >website >into svn, with unit tests, and having the website pull those from svn on >the fly to always get the latest tested version. > > >That led to an idea. Not sure if it'll work yet, but... > >What about having multiple Tika OSGi bundles? Continue with the "full" >bundle as now, but also have ones for "pdf", "microsoft office", "images" >etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if >they only wanted a handful of parsers, or the full one as now. > >The smart bit - we have unit tests for these smaller bundles. These unit >tests ensure that the desired parsers still work on their smaller bundle. >These unit tests also ensure that unwanted parsers don't work, thus >flagging up if extra dependencies have snuck though. > >Finally, we pull out the includes/excludes information that went into the >bundle, and display that for non-OSGi users. A non-OSGi person wanting >"tika with pdf only" could then look at what the tika-pdf-bundle does and >doesn't use, and from that know what maven level dependencies to keep and >which to exclude > > >This new plan would mean having to tweak our build to support multiple >bundles, and potentially tweaking our bundles so that you could load >tika-pdf + tika-image and have those two play nicely together. It'd also >need some new unit tests, and the work to figure out what to >include/exclude for each of our handful of "common" cases. It should, >however, deliver a way for OSGi and non-OSGi people to get just a subset >if that's all they want. > >Can anyone see a flaw with this plan? Anyone see a better way? Anyone >want >to help? :) > >Nick
