Hi Nick, All,

I've revisited this subject recently. I have to admit it is not ideal.
I see new parsers are added every two weeks or so and having downstream tika-parsers consumers keeping excluding all the required dependencies (which can change dynamically - well, it's not that dynamic :-) but you see what I mean) can present the problem.

How about this approach:

Introduce tika-parsers-optional module (pom.xml only) which will be exactly the same as tika-parsers except that tika-parsers-optional will depend on tika-parsers but have all the specific parser libs dependencies set as optional. Effectively this pom.xml will only have
a single dependency with

<dependency>
  <artifactId>tika-parsers</dependency>
  <exclusions>
    <!-- exclude specific parser libs -->
  </exclusions>
</dependency>

The users who do not want to spend time on excluding all and every parser lib deps they do not need will use tika-parsers-optional and look at the Tika Documentation and add only those specific deps that they need.

To be honest this seems to be a rather messy approach, having tika-parsers using optional parser lib dependencies and getting users add those libs they actually need (again after looking at the documentation) is better. This is not that distabilizing to be honest - any practical application is expected to be aware of the actual file formats and parser libs supporting those formats.

But I'd like to propose tika-parsers-optional as an alternative, its advantage is that it can all of existing tika-parsers users in peace...

Thoughts ?

Thanks, Sergey



On 19/06/14 20:22, Nick Burch wrote:
On Thu, 19 Jun 2014, Ray Gauss wrote:
The point of a tika-parsers-all artifact would be a single dependency
that re-aggregates everything so that downstream projects could work
the same way they do now and not worry about missing dependencies.

What’s the disadvantage for splitting things up (in a 2.0 timeframe)?

We already have users confused by the current split between tika-core
and tika-parsers - see users list for example. We already have users
confused by what dependencies they need with the current poms setup.
Splitting is going to make that a lot worse. (POI, as a related example,
sees plenty of confused users who've got mis-matched jars and problems.
Splitting is going to make that a lot worse.)

We have previously tried pushing parsers out of the tika parser jar and
into other jars, eg ones maintained by external groups, but on the whole
it hasn't been a great success. Keeping them in sync, dealing with
different cycles, applying updates, keeping them consistent, building in
a sensible length of time, all of that would be harder with a pile of
modules.

If we were to split out out to the level needed by some of the use cases
mentioned, we'd have so many parser modules it'd be a nightmare to
maintain, and would case problems mentioned above. (People in other
threads have cautioned on these problems). If we split into just a
handful of sub modules, then many of the uses cases mentioned still have
to do work to pick out the bits they need

I still believe that the main use case of tika is "everything included",
and especially that's the beginners use case, so I think we should focus
on keeping that easy. Peeling out just some bits feels like an advanced
use case to me, so I'd rather we put the requirement for effort onto
those folks, rather than onto newbies and people on the typical uses.
I'd therefore much rather we provide advanced docs/help on excluding
some bits, rather than pull it out into a pile of different modules.

Nick


Reply via email to