Re: Classpath and plugins question

Antony Bowesman Thu, 19 Apr 2007 18:44:24 -0700

Thanks Briggs and Sami for the pointers, very helpful. I also came acrossParseUtil and the ZipTextExtractor source was a perfect example for getting mytest cases up and running!

The separation of plugins with classloaders is a good idea. I have a number ofdifferent versions of jar files in my existing framework compared to the nutch0.9 distribution and, for example, PDF plugin would not work with the latestPDFbox library.

It's been pretty pain free so far to get up and running - it looks to hangtogether quite well so far. I built the mp3 parser and it plugged in easily enough.


The Tika project sounds interesting.

It would be good to remove the dependency on Hadoop Configuration for just theparsing framework.


Antony


Sami Siren wrote:

Antony Bowesman wrote:

I'm looking to use the Nutch parsing framework in a separate Lucene
project. I'd like to be able to use the existing plugins directory
structure as-is, so wondered Nutch sets up the class loading environment
to find all the jar files in the plugins directories.


There are dedicated class loaders for each plugin. The classpath is
constructed (recursively) based on plugin metadata (plugin.xml).

Any pointers to the Nutch class(es) that do the work?


Check the package o.a.n.plugin which contains most of the general
plug-in code.

There's also a recently established project called Apache Tika [1] which
has a goal of putting together generally usable parsing/extracting
framework. It hasn't yet got out of the ground so there is a good chance
to get your voice heard.

[1] http://incubator.apache.org/tika/

Re: Classpath and plugins question

Reply via email to