I’m not sure the third option is much more work up front than pulling apart the transitive dependencies for documentation purposes, though it is more sensitive as you say.
Just to confirm, with any of the other solutions we would need to manually document not just immediate dependencies but all transitive dependencies for each new parser added going forward rather than letting Maven automagically manage things, correct? Regards, Ray On July 15, 2014 at 5:58:11 AM, Sergey Beryozkin ([email protected]) wrote: > Hi All, > I've opened 2 JIRA issues, see [1] and [2]. > > [1] is about documenting the 3rd party transitive tika-parser > dependencies to help Maven users to exclude the kibs not required in a > given project. > > Help on resolving [1] form true Tika experts like Nick and others would > be appreciated :-). > > I can volunteer to fixing [2], but not only because that involves much > less work :-). > > in [2] (which strongly depends on the resolution of [1]) I proposed > either making tika-parsers pom optionally depend on the 3rd party libs > (in which case I can promise Nick I will answer every user query related > to the new tika-parsers module not strongly depending on all of 3rd > party libs :-)) or keep tika-parsers intact and introduce a > tika-parsers-optional pom. > > There's also a 3rd solution mentioned earlier involving a complete > modularization of tika-parsers - that would be a more involved and > possibly more sensitive solution so I'm not adding it to the list in [2] > for now to make it easier for us to come to some resolution... > > Thanks, Sergey > > [1] https://issues.apache.org/jira/browse/TIKA-1367 > [2] https://issues.apache.org/jira/browse/TIKA-1368 > > On 14/07/14 22:19, Sergey Beryozkin wrote: > > Hi Nick, All, > > > > I've revisited this subject recently. I have to admit it is not ideal. > > I see new parsers are added every two weeks or so and having downstream > > tika-parsers consumers keeping excluding all the required dependencies > > (which can change dynamically - well, it's not that dynamic :-) but you > > see what I mean) can present the problem. > > > > How about this approach: > > > > Introduce tika-parsers-optional module (pom.xml only) which will be > > exactly the same as tika-parsers except that tika-parsers-optional will > > depend on tika-parsers but have all the specific parser libs > > dependencies set as optional. Effectively this pom.xml will only have > > a single dependency with > > > > > > tika-parsers > > > > > > > > > > > > The users who do not want to spend time on excluding all and every > > parser lib deps they do not need will use tika-parsers-optional and look > > at the Tika Documentation and add only those specific deps that they need. > > > > To be honest this seems to be a rather messy approach, having > > tika-parsers using optional parser lib dependencies and getting users > > add those libs they actually need (again after looking at the > > documentation) is better. This is not that distabilizing to be honest - > > any practical application is expected to be aware of the actual file > > formats and parser libs supporting those formats. > > > > But I'd like to propose tika-parsers-optional as an alternative, its > > advantage is that it can all of existing tika-parsers users in peace... > > > > Thoughts ? > > > > Thanks, Sergey > > > > > > > > On 19/06/14 20:22, Nick Burch wrote: > >> On Thu, 19 Jun 2014, Ray Gauss wrote: > >>> The point of a tika-parsers-all artifact would be a single dependency > >>> that re-aggregates everything so that downstream projects could work > >>> the same way they do now and not worry about missing dependencies. > >>> > >>> What’s the disadvantage for splitting things up (in a 2.0 timeframe)? > >> > >> We already have users confused by the current split between tika-core > >> and tika-parsers - see users list for example. We already have users > >> confused by what dependencies they need with the current poms setup. > >> Splitting is going to make that a lot worse. (POI, as a related example, > >> sees plenty of confused users who've got mis-matched jars and problems. > >> Splitting is going to make that a lot worse.) > >> > >> We have previously tried pushing parsers out of the tika parser jar and > >> into other jars, eg ones maintained by external groups, but on the whole > >> it hasn't been a great success. Keeping them in sync, dealing with > >> different cycles, applying updates, keeping them consistent, building in > >> a sensible length of time, all of that would be harder with a pile of > >> modules. > >> > >> If we were to split out out to the level needed by some of the use cases > >> mentioned, we'd have so many parser modules it'd be a nightmare to > >> maintain, and would case problems mentioned above. (People in other > >> threads have cautioned on these problems). If we split into just a > >> handful of sub modules, then many of the uses cases mentioned still have > >> to do work to pick out the bits they need > >> > >> I still believe that the main use case of tika is "everything included", > >> and especially that's the beginners use case, so I think we should focus > >> on keeping that easy. Peeling out just some bits feels like an advanced > >> use case to me, so I'd rather we put the requirement for effort onto > >> those folks, rather than onto newbies and people on the typical uses. > >> I'd therefore much rather we provide advanced docs/help on excluding > >> some bits, rather than pull it out into a pile of different modules. > >> > >> Nick > > > > > >
