On Wed, 18 Jun 2014, Ken Krugler wrote:
On Jun 18, 2014, at 9:08am, Nick Burch <[email protected]> wrote:
On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
Can we start with adding a section to Tika docs documenting the core dependencies of the tike-parsers module to make the life a bit easier for developers who do not expect the specific parser implementations immediately downloaded ?

Are you not just better off asking Maven nicely, and have it tell you that info itself? Much more likely to be accurate and up-to-date than something we cut and paste from Maven's output from time to time…

I'm curious - assuming I only want to parse HTML and PDF (as an example), then what's the right way to ask Maven nicely for what I need to include?

That's a different question though. Sergey wanted the docs to list the core dependencies of Tika Parsers, which Maven can tell you. (Direct dependencies are listed in pom, direct + indirect from "mvn
dependency:list"

If you just want one Tika parser, the simplest way is to:
 * Use tika-app --list-parser-details to find out which class handles
   the mimetype you want
 * Grep the tika parsers source tree for that class's package, and get
   the list of imports it makes
 * Change you pom which includes tika parsers to have an exclusion for *
   on the tika parsers dependency
 * Explicitly list the artifacts that provide the imports you saw

Yes, it is largely manual, but at the point where you want to exclude a bunch of tika parsers your use case is IMHO special enough that you're doing enough enough work that the above isn't much extra.

(For most people, having everything there as standard is what you want to start with)

Nick

Reply via email to