Thanks Briggs and Sami for the pointers, very helpful. I also came across ParseUtil and the ZipTextExtractor source was a perfect example for getting my test cases up and running!
The separation of plugins with classloaders is a good idea. I have a number of different versions of jar files in my existing framework compared to the nutch 0.9 distribution and, for example, PDF plugin would not work with the latest PDFbox library. It's been pretty pain free so far to get up and running - it looks to hang together quite well so far. I built the mp3 parser and it plugged in easily enough. The Tika project sounds interesting. It would be good to remove the dependency on Hadoop Configuration for just the parsing framework. Antony Sami Siren wrote: > Antony Bowesman wrote: >> I'm looking to use the Nutch parsing framework in a separate Lucene >> project. I'd like to be able to use the existing plugins directory >> structure as-is, so wondered Nutch sets up the class loading environment >> to find all the jar files in the plugins directories. > > There are dedicated class loaders for each plugin. The classpath is > constructed (recursively) based on plugin metadata (plugin.xml). > >> Any pointers to the Nutch class(es) that do the work? > > Check the package o.a.n.plugin which contains most of the general > plug-in code. > > There's also a recently established project called Apache Tika [1] which > has a goal of putting together generally usable parsing/extracting > framework. It hasn't yet got out of the ground so there is a good chance > to get your voice heard. > > [1] http://incubator.apache.org/tika/ > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
