Hi Nick, All,
I've revisited this subject recently. I have to admit it is not ideal.
I see new parsers are added every two weeks or so and having downstream
tika-parsers consumers keeping excluding all the required dependencies
(which can change dynamically - well, it's not that dynamic :-) but you
see what I mean) can present the problem.
How about this approach:
Introduce tika-parsers-optional module (pom.xml only) which will be
exactly the same as tika-parsers except that tika-parsers-optional will
depend on tika-parsers but have all the specific parser libs
dependencies set as optional. Effectively this pom.xml will only have
a single dependency with
<dependency>
<artifactId>tika-parsers</dependency>
<exclusions>
<!-- exclude specific parser libs -->
</exclusions>
</dependency>
The users who do not want to spend time on excluding all and every
parser lib deps they do not need will use tika-parsers-optional and look
at the Tika Documentation and add only those specific deps that they need.
To be honest this seems to be a rather messy approach, having
tika-parsers using optional parser lib dependencies and getting users
add those libs they actually need (again after looking at the
documentation) is better. This is not that distabilizing to be honest -
any practical application is expected to be aware of the actual file
formats and parser libs supporting those formats.
But I'd like to propose tika-parsers-optional as an alternative, its
advantage is that it can all of existing tika-parsers users in peace...
Thoughts ?
Thanks, Sergey
On 19/06/14 20:22, Nick Burch wrote:
On Thu, 19 Jun 2014, Ray Gauss wrote:
The point of a tika-parsers-all artifact would be a single dependency
that re-aggregates everything so that downstream projects could work
the same way they do now and not worry about missing dependencies.
What’s the disadvantage for splitting things up (in a 2.0 timeframe)?
We already have users confused by the current split between tika-core
and tika-parsers - see users list for example. We already have users
confused by what dependencies they need with the current poms setup.
Splitting is going to make that a lot worse. (POI, as a related example,
sees plenty of confused users who've got mis-matched jars and problems.
Splitting is going to make that a lot worse.)
We have previously tried pushing parsers out of the tika parser jar and
into other jars, eg ones maintained by external groups, but on the whole
it hasn't been a great success. Keeping them in sync, dealing with
different cycles, applying updates, keeping them consistent, building in
a sensible length of time, all of that would be harder with a pile of
modules.
If we were to split out out to the level needed by some of the use cases
mentioned, we'd have so many parser modules it'd be a nightmare to
maintain, and would case problems mentioned above. (People in other
threads have cautioned on these problems). If we split into just a
handful of sub modules, then many of the uses cases mentioned still have
to do work to pick out the bits they need
I still believe that the main use case of tika is "everything included",
and especially that's the beginners use case, so I think we should focus
on keeping that easy. Peeling out just some bits feels like an advanced
use case to me, so I'd rather we put the requirement for effort onto
those folks, rather than onto newbies and people on the typical uses.
I'd therefore much rather we provide advanced docs/help on excluding
some bits, rather than pull it out into a pile of different modules.
Nick