Thanks Briggs and Sami for the pointers, very helpful. I also came across
ParseUtil and the ZipTextExtractor source was a perfect example for getting my
test cases up and running!
The separation of plugins with classloaders is a good idea. I have a number of
different versions of jar files in my existing framework compared to the nutch
0.9 distribution and, for example, PDF plugin would not work with the latest
PDFbox library.
It's been pretty pain free so far to get up and running - it looks to hang
together quite well so far. I built the mp3 parser and it plugged in easily enough.
The Tika project sounds interesting.
It would be good to remove the dependency on Hadoop Configuration for just the
parsing framework.
Antony
Sami Siren wrote:
Antony Bowesman wrote:
I'm looking to use the Nutch parsing framework in a separate Lucene
project. I'd like to be able to use the existing plugins directory
structure as-is, so wondered Nutch sets up the class loading environment
to find all the jar files in the plugins directories.
There are dedicated class loaders for each plugin. The classpath is
constructed (recursively) based on plugin metadata (plugin.xml).
Any pointers to the Nutch class(es) that do the work?
Check the package o.a.n.plugin which contains most of the general
plug-in code.
There's also a recently established project called Apache Tika [1] which
has a goal of putting together generally usable parsing/extracting
framework. It hasn't yet got out of the ground so there is a good chance
to get your voice heard.
[1] http://incubator.apache.org/tika/