Hi,
On 15/07/14 12:34, Ray Gauss wrote:
I’m not sure the third option is much more work up front than pulling apart the 
transitive dependencies for documentation purposes, though it is more sensitive 
as you say.

As far as I understand the 3rd option would require introducing many micro modules. I guess the bigger tika-parsers becomes the more important the 3rd option becomes. I'd just like to come up with some intermediary decision to get things moving a bit. 3rd option can be reviewed for 2.0 as you suggested, etc...

Just to confirm, with any of the other solutions we would need to manually 
document not just immediate dependencies but all transitive dependencies for 
each new parser added going forward rather than letting Maven automagically 
manage things, correct?

I'm thinking of documenting only top-level transitive dependencies, example, if we want to work with PDFParser then we'd see that a pdf-box lib is used for it (documenting pdf-box own dependencies is out of scope). If tike-parsers has some dependencies which are required by most of Parser implementations then they'd stay as is there...

Cheers, Sergey

Regards,

Ray


On July 15, 2014 at 5:58:11 AM, Sergey Beryozkin ([email protected]) wrote:
Hi All,
I've opened 2 JIRA issues, see [1] and [2].

[1] is about documenting the 3rd party transitive tika-parser
dependencies to help Maven users to exclude the kibs not required in a
given project.

Help on resolving [1] form true Tika experts like Nick and others would
be appreciated :-).

I can volunteer to fixing [2], but not only because that involves much
less work :-).

in [2] (which strongly depends on the resolution of [1]) I proposed
either making tika-parsers pom optionally depend on the 3rd party libs
(in which case I can promise Nick I will answer every user query related
to the new tika-parsers module not strongly depending on all of 3rd
party libs :-)) or keep tika-parsers intact and introduce a
tika-parsers-optional pom.

There's also a 3rd solution mentioned earlier involving a complete
modularization of tika-parsers - that would be a more involved and
possibly more sensitive solution so I'm not adding it to the list in [2]
for now to make it easier for us to come to some resolution...

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/TIKA-1367
[2] https://issues.apache.org/jira/browse/TIKA-1368

On 14/07/14 22:19, Sergey Beryozkin wrote:
Hi Nick, All,

I've revisited this subject recently. I have to admit it is not ideal.
I see new parsers are added every two weeks or so and having downstream
tika-parsers consumers keeping excluding all the required dependencies
(which can change dynamically - well, it's not that dynamic :-) but you
see what I mean) can present the problem.

How about this approach:

Introduce tika-parsers-optional module (pom.xml only) which will be
exactly the same as tika-parsers except that tika-parsers-optional will
depend on tika-parsers but have all the specific parser libs
dependencies set as optional. Effectively this pom.xml will only have
a single dependency with


tika-parsers





The users who do not want to spend time on excluding all and every
parser lib deps they do not need will use tika-parsers-optional and look
at the Tika Documentation and add only those specific deps that they need.

To be honest this seems to be a rather messy approach, having
tika-parsers using optional parser lib dependencies and getting users
add those libs they actually need (again after looking at the
documentation) is better. This is not that distabilizing to be honest -
any practical application is expected to be aware of the actual file
formats and parser libs supporting those formats.

But I'd like to propose tika-parsers-optional as an alternative, its
advantage is that it can all of existing tika-parsers users in peace...

Thoughts ?

Thanks, Sergey



On 19/06/14 20:22, Nick Burch wrote:
On Thu, 19 Jun 2014, Ray Gauss wrote:
The point of a tika-parsers-all artifact would be a single dependency
that re-aggregates everything so that downstream projects could work
the same way they do now and not worry about missing dependencies.

What’s the disadvantage for splitting things up (in a 2.0 timeframe)?

We already have users confused by the current split between tika-core
and tika-parsers - see users list for example. We already have users
confused by what dependencies they need with the current poms setup.
Splitting is going to make that a lot worse. (POI, as a related example,
sees plenty of confused users who've got mis-matched jars and problems.
Splitting is going to make that a lot worse.)

We have previously tried pushing parsers out of the tika parser jar and
into other jars, eg ones maintained by external groups, but on the whole
it hasn't been a great success. Keeping them in sync, dealing with
different cycles, applying updates, keeping them consistent, building in
a sensible length of time, all of that would be harder with a pile of
modules.

If we were to split out out to the level needed by some of the use cases
mentioned, we'd have so many parser modules it'd be a nightmare to
maintain, and would case problems mentioned above. (People in other
threads have cautioned on these problems). If we split into just a
handful of sub modules, then many of the uses cases mentioned still have
to do work to pick out the bits they need

I still believe that the main use case of tika is "everything included",
and especially that's the beginners use case, so I think we should focus
on keeping that easy. Peeling out just some bits feels like an advanced
use case to me, so I'd rather we put the requirement for effort onto
those folks, rather than onto newbies and people on the typical uses.
I'd therefore much rather we provide advanced docs/help on excluding
some bits, rather than pull it out into a pile of different modules.

Nick





Reply via email to