Thanks Briggs and Sami for the pointers, very helpful.  I also came across 
ParseUtil and the ZipTextExtractor source was a perfect example for getting my 
test cases up and running!

The separation of plugins with classloaders is a good idea. I have a number of 
different versions of jar files in my existing framework compared to the nutch 
0.9 distribution and, for example, PDF plugin would not work with the latest 
PDFbox library.

It's been pretty pain free so far to get up and running - it looks to hang 
together quite well so far.  I built the mp3 parser and it plugged in easily 
enough.

The Tika project sounds interesting.

It would be good to remove the dependency on Hadoop Configuration for just the 
parsing framework.

Antony


Sami Siren wrote:
> Antony Bowesman wrote:
>> I'm looking to use the Nutch parsing framework in a separate Lucene
>> project. I'd like to be able to use the existing plugins directory
>> structure as-is, so wondered Nutch sets up the class loading environment
>> to find all the jar files in the plugins directories.
> 
> There are dedicated class loaders for each plugin. The classpath is
> constructed (recursively) based on plugin metadata (plugin.xml).
> 
>> Any pointers to the Nutch class(es) that do the work?
> 
> Check the package o.a.n.plugin which contains most of the general
> plug-in code.
> 
> There's also a recently established project called Apache Tika [1] which
> has a goal of putting together generally usable parsing/extracting
> framework. It hasn't yet got out of the ground so there is a good chance
> to get your voice heard.
> 
> [1] http://incubator.apache.org/tika/
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to