On Jun 19, 2014, at 2:53am, Sergey Beryozkin <[email protected]> wrote:

> Hi
> On 19/06/14 01:58, Ray Gauss wrote:
>> The point of a tika-parsers-all artifact would be a single dependency that 
>> re-aggregates everything so that downstream projects could work the same way 
>> they do now and not worry about missing dependencies.
>> 
>> Meanwhile people that just want PDF parsing could declare only the 
>> tike-parser-pdf dependency.
>> 
>> We could go the other way, focusing on exclusions, but as we add more 
>> parsers for different types those downstream projects will have to be 
>> constantly be updating those exclusion lists.
>> 
>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
>> 
> 
> From what I understand the concern is the proliferation of many new micro 
> modules.
> 
> I wonder if tika-parsers has anything extra but only specific parser 
> implementations with some related support modules. If yes then effectively it 
> is 'tika-parsers-all'.
> 
> If it were the case then I'd settle for documenting the individual 
> dependencies supporting specific file extensions/media types

That's essentially what I was wondering about when I asked Nick:

>> I'm curious - assuming I only want to parse HTML and PDF (as an example), 
>> then what's the right way to ask Maven nicely for what I need to include?

The current approach seems (still) to be:

> * Use tika-app --list-parser-details to find out which class handles
>   the mimetype you want
> * Grep the tika parsers source tree for that class's package, and get
>   the list of imports it makes
> * Explicitly list the artifacts that provide the imports you saw

Unfortunately this is error-prone. There's no real way to know for sure that 
you have all the required dependent jars.

My approach has been to use Maven to build the dependency graph, then whack the 
biggest unneeded transitive jars to reduce the footprint of our Hadoop job jar.

-- Ken


>> On June 18, 2014 at 11:39:00 AM, Nick Burch ([email protected]) wrote:
>>> On Wed, 18 Jun 2014, Ray Gauss wrote:
>>>> I think for 2.0 we should consider splitting out parsers into their own
>>>> projects for a streamlined dependency hierarchy then reassembling them
>>>> with something like a tika-parsers-all artifact.
>>> 
>>> We had another thread on that not that long ago, where someone cautioned
>>> against breaking it up into too many pieces. We also have fairly frequent
>>> posts on the users list from people who aren't getting any content
>>> returned, because they've forgotten to include a dependency on
>>> tika-parsers
>>> 
>>> I'm not convinced that splitting tika parsers into 20 odd dependencies is
>>> really going to help more than it hinders - more people will get confused
>>> by missing dependencies they really wanted, and anyone with special needs
>>> about what does/doesn't get parsed is probably going to be taking such
>>> care that they can just exclude everything by default anyway and just pull
>>> in what they need. I'd probably rather we just gave an example pom snippet
>>> that shows how to exclude all except one thing, and let people with
>>> special cases work from there.
>>> 
>>> Nick


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to