Hi Folks,

 

 My name is Chris Mattmann: I work at the Jet Propulsion Laboratory in Pasadena, CA, U.S.A. I'm new to the list. Nice to meet you all.

 

I am having some * major * trouble trying to build an RSS content parser plugin for nutch. My plugin is based on the parse-pdf plugin structure and uses the apache commons-feedparser library out of the Jakarta sandbox to try and parse rss feeds and send them to nutch for indexing. The probem that I am having is * very * strange. Basically after about 2 days of going around the Nutch source code I've tracked my problem down to basically the fact that for whatever reason, the jdom.jar library the commons-feedparser relies on, is not accessible via the Nutch Plugin runtime. I keep getting the same error whenever I run the crawler to crawl Rss pages. I've set up a dummy web page with a single link to an rss file. Here's the webpage:

 

http://baron.pagemewhen.com:8080/~chris/hi.rss

 

 

So, basically then I seed my crawler with the baron.pagemewhen.com:8080/~chris/ webpage, and then tell it to go get the content and start parsing via the ./bin/nutch crawl command. So, then when it's crawling I get the attached output in the nutch-crawl-log.txt file(along with print lines that I've inserted into the nutch source code myself so I can see what's happening: these are denoted by the (&(&(& CHRIS variants). I've went round and round in the PluginRepository/PluginDescriptor classes in the net.nutch.plugin package, and I pretty now fully understand how everything is working and how the pluginclassloader is loading the classes. You can even see in my log file that it got all the correct classes in my classpath. The files are located in the right directory, as all the class path urls to the jar files that it captured I have verified. Further, I wrote a test2.java program that simulates dynamically loading the rss parser class uses an URLCLassLoader, and for whatever strange reason, that same code works! Just not in Nutch. I've attached this program as well for your convenience (the test2.java) attachment. I've also attached my rss plugin, along with dependent jar files in the plugin structure, along with my plugin.xml file. The plugin is located here in a zip file: http://baron.pagemewhen.com:8080/~chris/parse-rss.zip . Can someone please give me some idea as to what I'm doing wrong here???? I am so frustrated I'm pulling my hair out :-(

 

 

Further, while purusing the PluginManifestParser class looking for a solution to my problem, I believe that I have found a bug. First off, I wanted to let you know that I'm working with the 0.6 version of Nutch. So, inside the PluginManifestParser, I found the place where it's loading the libraries. Well, when it looks for the "export" sub-element in the library element within the plugin.xml file, there is actually a typo that is causing it to not function correctly on exported libraries. The typo is the following:

 

  /**

   * @param rootElement

   * @param pluginDescriptor

   */

  private static void parseLibraries(Element pRootElement,

                                     PluginDescriptor pDescriptor) throws MalformedURLException {

    Element runtime = pRootElement.element("runtime");

    if (runtime == null)

      return;

    List libraries = runtime.elements("library");

    for (int i = 0; i < libraries.size(); i++) {

      Element library = (Element) libraries.get(i);

      String libName = library.attributeValue("name");

 

      //@Bug Fix

      //By: Chris Mattmann and Paul Ramirez

 

      Element exportElement = library.element("export"); //used to read "extport”

      if (exportElement != null)

        pDescriptor.addExportedLibRelative(libName);

      else

        pDescriptor.addNotExportedLibRelative(libName);

    }

  }

 

So, basically the xml child element name that it was looking for was misspelled. Since I don’t know how to commit to the Nutch source tree (or if it’s even allowed), I just wanted to pass this bug fix (I think it’s a bug correct me if I’m wrong) to you guys. I’m also very interested in becoming a committer/helping out on the project. I think it’s really cool.

 

 

So, yeah if you guys could help me with my plugin problem, it would be much appreciated. I’m doing this RSS plugin as part of my Cs599: Seminar on Search Engines Ph.D. course at the University of Southern California.

 

Thanks a lot.

 

Cheers,

  Chris

 

 

 

 

 

 

______________________________________________

Chris A. Mattmann

[EMAIL PROTECTED]

Staff Member

Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

 

_________________________________________________

Jet Propulsion Laboratory            Pasadena, CA

Office: 171-266B                        Mailstop:  171-246

Phone:  818-354-8810

_______________________________________________________

 

Disclaimer:  The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.

 

 

> -----Original Message-----

> From: John X [mailto:[EMAIL PROTECTED]

> Sent: Friday, March 25, 2005 6:24 PM

> To: [email protected]; J?r?me Charron

> Cc: [EMAIL PROTECTED]

> Subject: Re: Mime/Magic mapper

>

> On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote:

> > Does somebody know why John Xing deactivate the mime.magic.file

> > support in protocol-file plugin?

>

> The "disabled" are only hooks to use mimetype/magic mapper.

> The mapper I used in a project had license issue (can't be redistributed).

> There is no mapper code in nutch. That's about one year ago.

> If you know one without license issue, I will be happy to add it in.

>

> John

>

> > I'm writing an mbox-parser plugin, and typically, an mbox has no

> > extension => it's mime type could not be determined using

> > extension/mime-type mapper.

> > For an mbox, the mime-type can only be defined by "analyzing" the file

> > content (using a mime-type/magic mapper).

> >

> > Thanks

> >

> > J?r?me

> >

> >

> >

> > --

> > http://motrech.free.fr/ - motrech [home]

> > http://motrech.blogspot.com/ - motrech [blog]

> > http://fr.groups.yahoo.com/group/motrech - motrech [liste]

> > http://fr.groups.yahoo.com/group/frutch - frutch [liste]

> >

> __________________________________________

> http://www.neasys.com - A Good Place to Be

> Come to visit us today!

[EMAIL PROTECTED] nutch]$ ./bin/nutch crawl urls -dir webbase -depth 3
050325 174644 parsing 
file:/home/chris/cs599-search-engines/nutch/conf/nutch-default.xml
050325 174645 parsing 
file:/home/chris/cs599-search-engines/nutch/conf/crawl-tool.xml
050325 174645 parsing 
file:/home/chris/cs599-search-engines/nutch/conf/nutch-site.xml
050325 174645 No FS indicated, using default:local
050325 174645 crawl started in: webbase
050325 174645 rootUrlFile = urls
050325 174645 threads = 10
050325 174645 depth = 3
050325 174646 Created webdb at 
LocalFS,/home/chris/cs599-search-engines/nutch/webbase/db
050325 174646 Starting URL processing
050325 174646 Plugins: looking in: 
/home/chris/cs599-search-engines/nutch/build/plugins
050325 174646 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/protocol-file
050325 174646 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/protocol-ftp
050325 174646 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/protocol-http/plugin.xml
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-html/plugin.xml
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-text/plugin.xml
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-pdf
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/plugin.xml
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-msword
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-mp3
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-rtf
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/parse-ext
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/index-basic/plugin.xml
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/index-more
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/query-basic/plugin.xml
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/query-more
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/query-site/plugin.xml
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/query-url/plugin.xml
050325 174647 parsing: 
/home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-regex/plugin.xml
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-prefix
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/creativecommons
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/language-identifier
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/clustering-carrot2
050325 174647 not including: 
/home/chris/cs599-search-engines/nutch/build/plugins/ontology
050325 174647 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-regex/urlfilter-regex.jar
050325 174647 (@#(#(#( CHRIS: At Critical  Block
050325 174647 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor
050325 174647 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((#
050325 174647 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/urlfilter-regex/urlfilter-regex.jar
050325 174647 (&(&*(&( CHRIS: Loading Plugin: Regex URL Filter
050325 174647 (&(&(&((& CHRIS: Has Dependencies: 
050325 174648 found resource crawl-urlfilter.txt at 
file:/home/chris/cs599-search-engines/nutch/conf/crawl-urlfilter.txt
050325 174648 Using URL normalizer: net.nutch.net.BasicUrlNormalizer
050325 174648 Added 1 pages
050325 174648 Processing pagesByURL: Sorted 1 instructions in 0.033 seconds.
050325 174648 Processing pagesByURL: Sorted 30.3030303030303 instructions/second
050325 174648 Processing pagesByURL: Merged to new DB containing 1 records in 
0.0090 seconds
050325 174648 Processing pagesByURL: Merged 111.11111111111111 records/second
050325 174648 Processing pagesByMD5: Sorted 1 instructions in 0.0060 seconds.
050325 174648 Processing pagesByMD5: Sorted 166.66666666666666 
instructions/second
050325 174648 Processing pagesByMD5: Merged to new DB containing 1 records in 
0.0010 seconds
050325 174648 Processing pagesByMD5: Merged 1000.0 records/second
050325 174648 Processing linksByMD5: Copied file (4096 bytes) in 0.012 secs.
050325 174648 Processing linksByURL: Copied file (4096 bytes) in 0.0060 secs.
050325 174648 FetchListTool started
050325 174649 Processing pagesByURL: Sorted 1 instructions in 0.0070 seconds.
050325 174649 Processing pagesByURL: Sorted 142.85714285714286 
instructions/second
050325 174649 Processing pagesByURL: Merged to new DB containing 1 records in 
0.0010 seconds
050325 174649 Processing pagesByURL: Merged 1000.0 records/second
050325 174649 Processing pagesByMD5: Sorted 1 instructions in 0.0070 seconds.
050325 174649 Processing pagesByMD5: Sorted 142.85714285714286 
instructions/second
050325 174649 Processing pagesByMD5: Merged to new DB containing 1 records in 
0.0010 seconds
050325 174649 Processing pagesByMD5: Merged 1000.0 records/second
050325 174649 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs.
050325 174649 Processing linksByURL: Copied file (4096 bytes) in 0.0050 secs.
050325 174649 Processing 
/home/chris/cs599-search-engines/nutch/webbase/segments/20050325174648/fetchlist.unsorted:
 Sorted 1 entries in 0.0060 seconds.
050325 174649 Processing 
/home/chris/cs599-search-engines/nutch/webbase/segments/20050325174648/fetchlist.unsorted:
 Sorted 166.66666666666666 entries/second
050325 174649 Overall processing: Sorted 1 entries in 0.0060 seconds.
050325 174649 Overall processing: Sorted 0.0060 entries/second
050325 174649 FetchListTool completed
050325 174649 ParserFactory: staticPoint: 
Parser.X_POINT_ID=net.nutch.parse.Parser
050325 174649 logging at INFO
050325 174649 fetching http://baron.pagemewhen.com:8080/~chris/
050325 174649 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/protocol-http/protocol-http.jar
050325 174649 (@#(#(#( CHRIS: At Critical  Block
050325 174649 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor
050325 174649 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((#
050325 174649 PluginDescriptor:getClassLoader:Final TEST: LSJ#((&( CHRIS: 
Loading Plugin: Html Parse Plug-in
050325 174650 (&(&(&((& CHRIS: Has Dependencies: 
050325 174650 (&(&(CHRIS: Loading: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-html/nekohtml-0.9.4.jar
050325 174650 (&(*&(&& CHRIS: Content Type: text/html
050325 174650 (&(&(&(& CHRIS: Creating parser of type: 
net.nutch.parse.html.HtmlParser
050325 174651 status: segment 20050325174648, 1 pages, 0 errors, 161 bytes, 
1644 ms
050325 174651 status: 0.6082725 pages/s, 0.76509273 kb/s, 161.0 bytes/page
050325 174652 Updating /home/chris/cs599-search-engines/nutch/webbase/db
050325 174652 Updating for 
/home/chris/cs599-search-engines/nutch/webbase/segments/20050325174648
050325 174652 Processing document 0
050325 174652 Finishing update
050325 174652 Processing pagesByURL: Sorted 2 instructions in 0.0070 seconds.
050325 174652 Processing pagesByURL: Sorted 285.7142857142857 
instructions/second
050325 174652 Processing pagesByURL: Merged to new DB containing 2 records in 
0.0010 seconds
050325 174652 Processing pagesByURL: Merged 2000.0 records/second
050325 174652 Processing pagesByMD5: Sorted 3 instructions in 0.0070 seconds.
050325 174652 Processing pagesByMD5: Sorted 428.57142857142856 
instructions/second
050325 174653 Processing pagesByMD5: Merged to new DB containing 2 records in 
0.012 seconds
050325 174653 Processing pagesByMD5: Merged 166.66666666666666 records/second
050325 174653 Processing linksByMD5: Sorted 2 instructions in 0.0060 seconds.
050325 174653 Processing linksByMD5: Sorted 333.3333333333333 
instructions/second
050325 174653 Processing linksByMD5: Merged to new DB containing 1 records in 
0.0070 seconds
050325 174653 Processing linksByMD5: Merged 142.85714285714286 records/second
050325 174653 Processing linksByURL: Sorted 1 instructions in 0.0070 seconds.
050325 174653 Processing linksByURL: Sorted 142.85714285714286 
instructions/second
050325 174653 Processing linksByURL: Merged to new DB containing 1 records in 
0.0070 seconds
050325 174653 Processing linksByURL: Merged 142.85714285714286 records/second
050325 174653 Processing linksByMD5: Sorted 1 instructions in 0.02 seconds.
050325 174653 Processing linksByMD5: Sorted 50.0 instructions/second
050325 174653 Processing linksByMD5: Merged to new DB containing 1 records in 
0.0020 seconds
050325 174653 Processing linksByMD5: Merged 500.0 records/second
050325 174653 Update finished
050325 174653 FetchListTool started
050325 174653 Processing pagesByURL: Sorted 1 instructions in 0.0060 seconds.
050325 174653 Processing pagesByURL: Sorted 166.66666666666666 
instructions/second
050325 174653 Processing pagesByURL: Merged to new DB containing 2 records in 
0.0010 seconds
050325 174653 Processing pagesByURL: Merged 2000.0 records/second
050325 174653 Processing pagesByMD5: Sorted 1 instructions in 0.011 seconds.
050325 174653 e-rss.jar
050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Not Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar
050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Not Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar
050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Not Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar
050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Not Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/saxpath.jar
050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Not Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-httpclient-3.0-beta1.jar
050325 174654 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Not Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/log4j-1.2.6.jar
050325 174654 (@#(#(#( CHRIS: At Critical  Block
050325 174654 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor
050325 174654 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((#
050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/parse-rss.jar
050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar
050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar
050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar
050325 174654 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: net.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:109)
Caused by: java.lang.NoClassDefFoundError: org/jdom/Document
        at 
org.jaxen.jdom.DocumentNavigator.getDocumentNode(DocumentNavigator.java:313)
        at 
org.jaxen.expr.DefaultAbsoluteLocationPath.evaluate(DefaultAbsoluteLocationPath.java:113)
        at org.jaxen.expr.DefaultXPathExpr.asList(DefaultXPathExpr.java:107)
        at org.jaxen.BaseXPath.selectNodesForContext(BaseXPath.java:716)
        at org.jaxen.BaseXPath.selectNodes(BaseXPath.java:239)
        at org.jaxen.BaseXPath.selectSingleNode(BaseXPath.java:262)
        at 
org.apache.commons.feedparser.RSSFeedParser.parse(RSSFeedParser.java:64)
        at 
org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:187)
        ... 4 more
050325 174656 EXXXCEEEEPTION java.lang.NoClassDefFoundError: org/jdom/Document
050325 174656 fetch okay, but can't parse 
http://baron.pagemewhen.com:8080/~chris/hi.rss, reason: Can't be handled as rss 
document. org.apache.commons.feedparser.FeedParserException: 
java.lang.NoClassDefFoundError: org/jdom/Document
050325 174656 status: segment 20050325174653, 1 pages, 0 errors, 2741 bytes, 
2073 ms
050325 174656 status: 0.48239267 pages/s, 10.329987 kb/s, 2741.0 bytes/page
050325 174657 Updating /home/chris/cs599-search-engines/nutch/webbase/db
050325 174657 Updating for 
/home/chris/cs599-search-engines/nutch/webbase/segments/20050325174653
050325 174657 Processing document 0
050325 174657 Finishing update
050325 174657 Processing pagesByURL: Sorted 1 instructions in 0.0060 seconds.
050325 174657 Processing pagesByURL: Sorted 166.66666666666666 
instructions/second
050325 174657 Processing pagesByURL: Merged to new DB containing 2 records in 
0.0010 seconds
050325 174657 Processing pagesByURL: Merged 2000.0 records/second
050325 174658 Processing pagesByMD5: Sorted 1 instructions in 0.0060 seconds.
050325 174658 Processing pagesByMD5: Sorted 166.66666666666666 
instructions/second
050325 174658 Processing pagesByMD5: Merged to new DB containing 2 records in 
0.0010 seconds
050325 174658 Processing pagesByMD5: Merged 2000.0 records/second
050325 174658 Processing linksByMD5: Copied file (4096 bytes) in 0.012 secs.
050325 174658 Processing linksByURL: Copied file (4096 bytes) in 0.0050 secs.
05032525 174700 logging at INFO
050325 174700 fetching http://baron.pagemewhen.com:8080/~chris/
050325 174700 getExtension: contentType=text/html suffix=null
050325 174700 (&(*&(&& CHRIS: Content Type: text/html
050325 174700 (&(&(&(& CHRIS: Creating parser of type: 
net.nutch.parse.html.HtmlParser
050325 174701 status: segment 20050325174659, 1 pages, 0 errors, 161 bytes, 
1044 ms
050325 174701 status: 0.9578544 pages/s, 1.2048012 kb/s, 161.0 bytes/page
050325 174703 indexing segment: 
/home/chris/cs599-search-engines/nutch/webbase/segments/20050325174659
050325 174703 * Opening segment 20050325174659
050325 174703 * Indexing segment 20050325174659
050325 174703 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/index-basic/index-basic.jar
050325 174703 (@#(#(#( CHRIS: At Critical  Block
050325 174703 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor
050325 174703 (##(#(#( FINISHED CRITICAL BLOCK *(#(#(#(#(#(#((#
050325 174703 PluginDescriptor:getClassLoader:Final TEST: LSJ#(&(#$& CHRIS: 
Loading URL: 
file:/home/chris/cs599-search-engines/nutch/build/plugins/index-basic/index-basic.jar
050325 174703 (&(&*(&( CHRIS: Loading Plugin: Basic Indexing Filter
050325 174703 (&(&(&((& CHRIS: Has Dependencies: 
050325 174703 (@#&(#(*@&(#&#&# CHRIS: PluginDescriptor:getClassLoader:Adding 
Exported lib 
file:/home/chris/cs599-search-engines/nutch/build/plugins/query-site/query-site.jar
050325 174703 (@#(#(#( CHRIS: At Critical  Block
050325 174703 PLUG CLASS GETNAME: (#(#(#@(#(#( net.nutch.plugin.PluginDescriptor
050325 174703 (##(#(#( FINISHED CRITICAL B

Reply via email to