Re: Subsets of tika parsers redux

Mattmann, Chris A (3980) Mon, 24 Nov 2014 07:57:46 -0800

Hey Nick,

This sounds like a great plan to me, good job to you
and Sergey. As for helping I¹ll try my best, but I¹m not
an OSGI guru :)


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Nick Burch <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, November 23, 2014 at 6:12 PM
To: "[email protected]" <[email protected]>
Subject: Subsets of tika parsers redux

>Hi All
>
>During ApacheCon, I had a chance to chat with Sergey about the "subset of
>Tika Parsers" issue that bubbles up from time to time. It seemed to work
>well, and I think we both now have a better idea of the other's needs and
>concerns, which is good :)
>
>As is shown on our list from time to time, but more commonly elsewhere,
>we 
>have some users who are confused already by the split between tika-core
>and tika-parsers. Anything that fragments further is going to cause more
>issues for that kind of user.
>
>On the other hand, there are potential users out there who want just a
>handful of parsers, in a simple and easy and small way, who don't know a
>lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those
>are using OSGi, but not all.
>
>One suggested solution is to just document what dependencies of
>tika-parsers can be excluded at the maven level to disable certain
>parsers 
>+ shrink the resulting dependency tree. However, that requires manual
>updates, manual checking, and like our examples on the website risk
>getting out of date without automated checking.
>
>Discussion then turned to our move to get all the examples for the
>website 
>into svn, with unit tests, and having the website pull those from svn on
>the fly to always get the latest tested version.
>
>
>That led to an idea. Not sure if it'll work yet, but...
>
>What about having multiple Tika OSGi bundles? Continue with the "full"
>bundle as now, but also have ones for "pdf", "microsoft office", "images"
>etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if
>they only wanted a handful of parsers, or the full one as now.
>
>The smart bit - we have unit tests for these smaller bundles. These unit
>tests ensure that the desired parsers still work on their smaller bundle.
>These unit tests also ensure that unwanted parsers don't work, thus
>flagging up if extra dependencies have snuck though.
>
>Finally, we pull out the includes/excludes information that went into the
>bundle, and display that for non-OSGi users. A non-OSGi person wanting
>"tika with pdf only" could then look at what the tika-pdf-bundle does and
>doesn't use, and from that know what maven level dependencies to keep and
>which to exclude
>
>
>This new plan would mean having to tweak our build to support multiple
>bundles, and potentially tweaking our bundles so that you could load
>tika-pdf + tika-image and have those two play nicely together. It'd also
>need some new unit tests, and the work to figure out what to
>include/exclude for each of our handful of "common" cases. It should,
>however, deliver a way for OSGi and non-OSGi people to get just a subset
>if that's all they want.
>
>Can anyone see a flaw with this plan? Anyone see a better way? Anyone
>want 
>to help? :)
>
>Nick

Re: Subsets of tika parsers redux

Reply via email to