|
Hi Folks, My name is Chris Mattmann: I work at the Jet Propulsion
Laboratory in I am having some * major * trouble trying to build an RSS content
parser plugin for nutch. My plugin is based on the parse-pdf plugin structure
and uses the apache commons-feedparser library out of the http://baron.pagemewhen.com:8080/~chris/hi.rss So, basically then I seed my crawler with the
baron.pagemewhen.com:8080/~chris/ webpage, and then tell it to go get the
content and start parsing via the ./bin/nutch crawl command. So, then when it's
crawling I get the attached output in the nutch-crawl-log.txt file(along with
print lines that I've inserted into the nutch source code myself so I can see
what's happening: these are denoted by the (&(&(& CHRIS variants).
I've went round and round in the PluginRepository/PluginDescriptor classes in
the net.nutch.plugin package, and I pretty now fully understand how everything
is working and how the pluginclassloader is loading the classes. You can even
see in my log file that it got all the correct classes in my classpath. The
files are located in the right directory, as all the class path urls to the jar
files that it captured I have verified. Further, I wrote a test2.java program
that simulates dynamically loading the rss parser class uses an URLCLassLoader,
and for whatever strange reason, that same code works! Just not in Nutch. I've
attached this program as well for your convenience (the test2.java) attachment.
I've also attached my rss plugin, along with dependent jar files in the plugin
structure, along with my plugin.xml file. The plugin is located here in a zip
file: http://baron.pagemewhen.com:8080/~chris/parse-rss.zip
. Can someone please give me some idea as to what I'm doing wrong here???? I am
so frustrated I'm pulling my hair out :-( Further, while purusing the PluginManifestParser class looking for a
solution to my problem, I believe that I have found a bug. First off, I wanted
to let you know that I'm working with the 0.6 version of Nutch. So, inside the
PluginManifestParser, I found the place where it's loading the libraries. Well,
when it looks for the "export" sub-element in the library element
within the plugin.xml file, there is actually a typo that is causing it to not
function correctly on exported libraries. The typo is the following: /** * @param rootElement * @param pluginDescriptor */ private static void parseLibraries(Element pRootElement,
PluginDescriptor pDescriptor) throws MalformedURLException { Element runtime =
pRootElement.element("runtime"); if (runtime == null) return; List libraries =
runtime.elements("library"); for (int i = 0; i < libraries.size(); i++) { Element library = (Element)
libraries.get(i); String libName =
library.attributeValue("name"); //@Bug Fix //By: Chris Mattmann and Paul Ramirez Element exportElement =
library.element("export"); //used to read "extport” if (exportElement != null)
pDescriptor.addExportedLibRelative(libName); else
pDescriptor.addNotExportedLibRelative(libName); } } So, basically the xml child element name that it was looking for was
misspelled. Since I don’t know how to commit to the Nutch source tree (or
if it’s even allowed), I just wanted to pass this bug fix (I think it’s
a bug correct me if I’m wrong) to you guys. I’m also very
interested in becoming a committer/helping out on the project. I think it’s
really cool. So, yeah if you guys could help me with my plugin problem, it would be
much appreciated. I’m doing this RSS plugin as part of my Cs599: Seminar
on Search Engines Ph.D. course at the Thanks a lot. Cheers, Chris ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion
Laboratory Office:
171-266B
Mailstop: 171-246 Phone: 818-354-8810 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not
reflect those of either NASA, JPL, or the California Institute of Technology. > -----Original Message----- > From: John X [mailto:[EMAIL PROTECTED] > Sent: Friday, March 25, 2005 6:24 PM > To: [email protected]; J?r?me Charron > Cc: [EMAIL PROTECTED] > Subject: Re: Mime/Magic mapper > > On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote: > > Does somebody know why John Xing deactivate the mime.magic.file > > support in protocol-file plugin? > > The "disabled" are only hooks to use mimetype/magic mapper. > The mapper I used in a project had license issue (can't be redistributed). > There is no mapper code in nutch. That's about one year ago. > If you know one without license issue, I will be happy to add it in. > > John > > > I'm writing an mbox-parser plugin, and typically, an mbox has no > > extension => it's mime type could not be determined using > > extension/mime-type mapper. > > For an mbox, the mime-type can only be defined by "analyzing" the file > > content (using a mime-type/magic mapper). > > > > Thanks > > > > J?r?me > > > > > > > > -- > > http://motrech.free.fr/ - motrech [home] > > http://motrech.blogspot.com/ - motrech [blog] > > http://fr.groups.yahoo.com/group/motrech - motrech [liste] > > http://fr.groups.yahoo.com/group/frutch - frutch [liste] > > > __________________________________________ > http://www.neasys.com - > Come to visit us today! |
[EMAIL PROTECTED] nutch]$ ./bin/nutch crawl urls -dir webbase -depth 3 050325 174644 parsing file:/home/chris/cs599-search-engines/nutch/conf/nutch-default.xml 050325 174645 parsing file:/home/chris/cs599-search-engines/nutch/conf/crawl-tool.xml 050325 174645 parsing file:/home/chris/cs599-search-engines/nutch/conf/nutch-site.xml 050325 174645 No FS indicated, using default:local 050325 174645 crawl started in: webbase 050325 174645 rootUrlFile = urls 050325 174645 threads = 10 050325 174645 depth = 3 050325 174646 Created webdb at LocalFS,/home/chris/cs599-search-engines/nutch/webbase/db 050325 174646 Starting URL processing 050325 174646 Plugins: looking in: /home/chris/cs599-search-engines/nutch/build/plugins 050325 174646 not including: /home/chris/cs599-search-engines/nutch/build/plugins/protocol-file 050325 174646 not including: /home/chris/cs599-search-engines/nutch/build/plugins/protocol-ftp 050325 174646 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/protocol-http/plugin.xml 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/parse-html/plugin.xml 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/parse-text/plugin.xml 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/parse-pdf 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/plugin.xml 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/parse-msword 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/parse-mp3 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/parse-rtf 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/parse-ext 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/index-basic/plugin.xml 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/index-more 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/query-basic/plugin.xml 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/query-more 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/query-site/plugin.xml 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/query-url/plugin.xml 050325 174647 parsing: /home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-regex/plugin.xml 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-prefix 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/creativecommons 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/language-identifier 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/clustering-carrot2 050325 174647 not including: /home/chris/cs599-search-engines/nutch/build/plugins/ontology 050325 174647 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-regex/urlfilter-regex.jar 050325 174647 (@#(#(#( CHRIS: At Critical Block 050325 174647 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor 050325 174647 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((# 050325 174647 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: file:/home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-regex/urlfilter-regex.jar 050325 174647 (&(&*(&( CHRIS: Loading Plugin: Regex URL Filter 050325 174647 (&(&(&((& CHRIS: Has Dependencies: 050325 174648 found resource crawl-urlfilter.txt at file:/home/chris/cs599-search-engines/nutch/conf/crawl-urlfilter.txt 050325 174648 Using URL normalizer: net.nutch.net.BasicUrlNormalizer 050325 174648 Added 1 pages 050325 174648 Processing pagesByURL: Sorted 1 instructions in 0.033 seconds. 050325 174648 Processing pagesByURL: Sorted 30.3030303030303 instructions/second 050325 174648 Processing pagesByURL: Merged to new DB containing 1 records in 0.0090 seconds 050325 174648 Processing pagesByURL: Merged 111.11111111111111 records/second 050325 174648 Processing pagesByMD5: Sorted 1 instructions in 0.0060 seconds. 050325 174648 Processing pagesByMD5: Sorted 166.66666666666666 instructions/second 050325 174648 Processing pagesByMD5: Merged to new DB containing 1 records in 0.0010 seconds 050325 174648 Processing pagesByMD5: Merged 1000.0 records/second 050325 174648 Processing linksByMD5: Copied file (4096 bytes) in 0.012 secs. 050325 174648 Processing linksByURL: Copied file (4096 bytes) in 0.0060 secs. 050325 174648 FetchListTool started 050325 174649 Processing pagesByURL: Sorted 1 instructions in 0.0070 seconds. 050325 174649 Processing pagesByURL: Sorted 142.85714285714286 instructions/second 050325 174649 Processing pagesByURL: Merged to new DB containing 1 records in 0.0010 seconds 050325 174649 Processing pagesByURL: Merged 1000.0 records/second 050325 174649 Processing pagesByMD5: Sorted 1 instructions in 0.0070 seconds. 050325 174649 Processing pagesByMD5: Sorted 142.85714285714286 instructions/second 050325 174649 Processing pagesByMD5: Merged to new DB containing 1 records in 0.0010 seconds 050325 174649 Processing pagesByMD5: Merged 1000.0 records/second 050325 174649 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs. 050325 174649 Processing linksByURL: Copied file (4096 bytes) in 0.0050 secs. 050325 174649 Processing /home/chris/cs599-search-engines/nutch/webbase/segments/20050325174648/fetchlist.unsorted: Sorted 1 entries in 0.0060 seconds. 050325 174649 Processing /home/chris/cs599-search-engines/nutch/webbase/segments/20050325174648/fetchlist.unsorted: Sorted 166.66666666666666 entries/second 050325 174649 Overall processing: Sorted 1 entries in 0.0060 seconds. 050325 174649 Overall processing: Sorted 0.0060 entries/second 050325 174649 FetchListTool completed 050325 174649 ParserFactory: staticPoint: Parser.X_POINT_ID=net.nutch.parse.Parser 050325 174649 logging at INFO 050325 174649 fetching http://baron.pagemewhen.com:8080/~chris/ 050325 174649 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/protocol-http/protocol-http.jar 050325 174649 (@#(#(#( CHRIS: At Critical Block 050325 174649 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor 050325 174649 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((# 050325 174649 PluginDescriptor:getClassLoader:Final TEST: LSJ#((&( CHRIS: Loading Plugin: Html Parse Plug-in 050325 174650 (&(&(&((& CHRIS: Has Dependencies: 050325 174650 (&(&(CHRIS: Loading: file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-html/nekohtml-0.9.4.jar 050325 174650 (&(*&(&& CHRIS: Content Type: text/html 050325 174650 (&(&(&(& CHRIS: Creating parser of type: net.nutch.parse.html.HtmlParser 050325 174651 status: segment 20050325174648, 1 pages, 0 errors, 161 bytes, 1644 ms 050325 174651 status: 0.6082725 pages/s, 0.76509273 kb/s, 161.0 bytes/page 050325 174652 Updating /home/chris/cs599-search-engines/nutch/webbase/db 050325 174652 Updating for /home/chris/cs599-search-engines/nutch/webbase/segments/20050325174648 050325 174652 Processing document 0 050325 174652 Finishing update 050325 174652 Processing pagesByURL: Sorted 2 instructions in 0.0070 seconds. 050325 174652 Processing pagesByURL: Sorted 285.7142857142857 instructions/second 050325 174652 Processing pagesByURL: Merged to new DB containing 2 records in 0.0010 seconds 050325 174652 Processing pagesByURL: Merged 2000.0 records/second 050325 174652 Processing pagesByMD5: Sorted 3 instructions in 0.0070 seconds. 050325 174652 Processing pagesByMD5: Sorted 428.57142857142856 instructions/second 050325 174653 Processing pagesByMD5: Merged to new DB containing 2 records in 0.012 seconds 050325 174653 Processing pagesByMD5: Merged 166.66666666666666 records/second 050325 174653 Processing linksByMD5: Sorted 2 instructions in 0.0060 seconds. 050325 174653 Processing linksByMD5: Sorted 333.3333333333333 instructions/second 050325 174653 Processing linksByMD5: Merged to new DB containing 1 records in 0.0070 seconds 050325 174653 Processing linksByMD5: Merged 142.85714285714286 records/second 050325 174653 Processing linksByURL: Sorted 1 instructions in 0.0070 seconds. 050325 174653 Processing linksByURL: Sorted 142.85714285714286 instructions/second 050325 174653 Processing linksByURL: Merged to new DB containing 1 records in 0.0070 seconds 050325 174653 Processing linksByURL: Merged 142.85714285714286 records/second 050325 174653 Processing linksByMD5: Sorted 1 instructions in 0.02 seconds. 050325 174653 Processing linksByMD5: Sorted 50.0 instructions/second 050325 174653 Processing linksByMD5: Merged to new DB containing 1 records in 0.0020 seconds 050325 174653 Processing linksByMD5: Merged 500.0 records/second 050325 174653 Update finished 050325 174653 FetchListTool started 050325 174653 Processing pagesByURL: Sorted 1 instructions in 0.0060 seconds. 050325 174653 Processing pagesByURL: Sorted 166.66666666666666 instructions/second 050325 174653 Processing pagesByURL: Merged to new DB containing 2 records in 0.0010 seconds 050325 174653 Processing pagesByURL: Merged 2000.0 records/second 050325 174653 Processing pagesByMD5: Sorted 1 instructions in 0.011 seconds. 050325 174653 e-rss.jar 050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Not Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar 050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Not Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar 050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Not Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar 050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Not Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/saxpath.jar 050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Not Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-httpclient-3.0-beta1.jar 050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Not Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/log4j-1.2.6.jar 050325 174654 (@#(#(#( CHRIS: At Critical Block 050325 174654 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor 050325 174654 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((# 050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/parse-rss.jar 050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar 050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar 050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar 050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: net.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:109) Caused by: java.lang.NoClassDefFoundError: org/jdom/Document at org.jaxen.jdom.DocumentNavigator.getDocumentNode(DocumentNavigator.java:313) at org.jaxen.expr.DefaultAbsoluteLocationPath.evaluate(DefaultAbsoluteLocationPath.java:113) at org.jaxen.expr.DefaultXPathExpr.asList(DefaultXPathExpr.java:107) at org.jaxen.BaseXPath.selectNodesForContext(BaseXPath.java:716) at org.jaxen.BaseXPath.selectNodes(BaseXPath.java:239) at org.jaxen.BaseXPath.selectSingleNode(BaseXPath.java:262) at org.apache.commons.feedparser.RSSFeedParser.parse(RSSFeedParser.java:64) at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:187) ... 4 more 050325 174656 EXXXCEEEEPTION java.lang.NoClassDefFoundError: org/jdom/Document 050325 174656 fetch okay, but can't parse http://baron.pagemewhen.com:8080/~chris/hi.rss, reason: Can't be handled as rss document. org.apache.commons.feedparser.FeedParserException: java.lang.NoClassDefFoundError: org/jdom/Document 050325 174656 status: segment 20050325174653, 1 pages, 0 errors, 2741 bytes, 2073 ms 050325 174656 status: 0.48239267 pages/s, 10.329987 kb/s, 2741.0 bytes/page 050325 174657 Updating /home/chris/cs599-search-engines/nutch/webbase/db 050325 174657 Updating for /home/chris/cs599-search-engines/nutch/webbase/segments/20050325174653 050325 174657 Processing document 0 050325 174657 Finishing update 050325 174657 Processing pagesByURL: Sorted 1 instructions in 0.0060 seconds. 050325 174657 Processing pagesByURL: Sorted 166.66666666666666 instructions/second 050325 174657 Processing pagesByURL: Merged to new DB containing 2 records in 0.0010 seconds 050325 174657 Processing pagesByURL: Merged 2000.0 records/second 050325 174658 Processing pagesByMD5: Sorted 1 instructions in 0.0060 seconds. 050325 174658 Processing pagesByMD5: Sorted 166.66666666666666 instructions/second 050325 174658 Processing pagesByMD5: Merged to new DB containing 2 records in 0.0010 seconds 050325 174658 Processing pagesByMD5: Merged 2000.0 records/second 050325 174658 Processing linksByMD5: Copied file (4096 bytes) in 0.012 secs. 050325 174658 Processing linksByURL: Copied file (4096 bytes) in 0.0050 secs. 05032525 174700 logging at INFO 050325 174700 fetching http://baron.pagemewhen.com:8080/~chris/ 050325 174700 getExtension: contentType=text/html suffix=null 050325 174700 (&(*&(&& CHRIS: Content Type: text/html 050325 174700 (&(&(&(& CHRIS: Creating parser of type: net.nutch.parse.html.HtmlParser 050325 174701 status: segment 20050325174659, 1 pages, 0 errors, 161 bytes, 1044 ms 050325 174701 status: 0.9578544 pages/s, 1.2048012 kb/s, 161.0 bytes/page 050325 174703 indexing segment: /home/chris/cs599-search-engines/nutch/webbase/segments/20050325174659 050325 174703 * Opening segment 20050325174659 050325 174703 * Indexing segment 20050325174659 050325 174703 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/index-basic/index-basic.jar 050325 174703 (@#(#(#( CHRIS: At Critical Block 050325 174703 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor 050325 174703 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((# 050325 174703 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: Loading URL: file:/home/chris/cs599-search-engines/nutch/build/plugins/index-basic/index-basic.jar 050325 174703 (&(&*(&( CHRIS: Loading Plugin: Basic Indexing Filter 050325 174703 (&(&(&((& CHRIS: Has Dependencies: 050325 174703 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding Exported lib file:/home/chris/cs599-search-engines/nutch/build/plugins/query-site/query-site.jar 050325 174703 (@#(#(#( CHRIS: At Critical Block 050325 174703 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor 050325 174703 (##(#(#( FINISHED CRITICAL B
