On Wed, 18 Jun 2014, Ken Krugler wrote:
On Jun 18, 2014, at 9:08am, Nick Burch <[email protected]> wrote:
On Wed, 18 Jun 2014, Sergey Beryozkin wrote:
Can we start with adding a section to Tika docs documenting the core
dependencies of the tike-parsers module to make the life a bit easier
for developers who do not expect the specific parser implementations
immediately downloaded ?
Are you not just better off asking Maven nicely, and have it tell you
that info itself? Much more likely to be accurate and up-to-date than
something we cut and paste from Maven's output from time to timeā¦
I'm curious - assuming I only want to parse HTML and PDF (as an
example), then what's the right way to ask Maven nicely for what I need
to include?
That's a different question though. Sergey wanted the docs to list the
core dependencies of Tika Parsers, which Maven can tell you. (Direct
dependencies are listed in pom, direct + indirect from "mvn
dependency:list"
If you just want one Tika parser, the simplest way is to:
* Use tika-app --list-parser-details to find out which class handles
the mimetype you want
* Grep the tika parsers source tree for that class's package, and get
the list of imports it makes
* Change you pom which includes tika parsers to have an exclusion for *
on the tika parsers dependency
* Explicitly list the artifacts that provide the imports you saw
Yes, it is largely manual, but at the point where you want to exclude a
bunch of tika parsers your use case is IMHO special enough that you're
doing enough enough work that the above isn't much extra.
(For most people, having everything there as standard is what you want to
start with)
Nick