RE: Huge Problem trying to develop plugin for Nutch

Chris Mattmann Fri, 25 Mar 2005 19:52:47 -0800

Hi,

For whatever reason (maybe file filtering) I think that my test2.java file that I attached didn’t go through. So, I renamed the extension to .txt. Let’s see if it goes through this time.

Sorry about having to send another email. Thanks very much for any help!

Cheers,

Chris

From: Chris Mattmann [mailto:[EMAIL PROTECTED]
Sent: Friday, March 25, 2005 6:09 PM
To: [email protected]
Cc: 'Ellis Horowitz'; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 'Rami A. Al-Ghanmi'; [EMAIL PROTECTED]
Subject: Huge Problem trying to develop plugin for Nutch

Hi Folks,

My name is Chris Mattmann: I work at the Jet Propulsion Laboratory in Pasadena, CA, U.S.A. I'm new to the list. Nice to meet you all.

I am having some * major * trouble trying to build an RSS content parser plugin for nutch. My plugin is based on the parse-pdf plugin structure and uses the apache commons-feedparser library out of the Jakarta sandbox to try and parse rss feeds and send them to nutch for indexing. The probem that I am having is * very * strange. Basically after about 2 days of going around the Nutch source code I've tracked my problem down to basically the fact that for whatever reason, the jdom.jar library the commons-feedparser relies on, is not accessible via the Nutch Plugin runtime. I keep getting the same error whenever I run the crawler to crawl Rss pages. I've set up a dummy web page with a single link to an rss file. Here's the webpage:

http://baron.pagemewhen.com:8080/~chris/hi.rss

So, basically then I seed my crawler with the baron.pagemewhen.com:8080/~chris/ webpage, and then tell it to go get the content and start parsing via the ./bin/nutch crawl command. So, then when it's crawling I get the attached output in the nutch-crawl-log.txt file(along with print lines that I've inserted into the nutch source code myself so I can see what's happening: these are denoted by the (&(&(& CHRIS variants). I've went round and round in the PluginRepository/PluginDescriptor classes in the net.nutch.plugin package, and I pretty now fully understand how everything is working and how the pluginclassloader is loading the classes. You can even see in my log file that it got all the correct classes in my classpath. The files are located in the right directory, as all the class path urls to the jar files that it captured I have verified. Further, I wrote a test2.java program that simulates dynamically loading the rss parser class uses an URLCLassLoader, and for whatever strange reason, that same code works! Just not in Nutch. I've attached this program as well for your convenience (the test2.java) attachment. I've also attached my rss plugin, along with dependent jar files in the plugin structure, along with my plugin.xml file. The plugin is located here in a zip file: http://baron.pagemewhen.com:8080/~chris/parse-rss.zip . Can someone please give me some idea as to what I'm doing wrong here???? I am so frustrated I'm pulling my hair out :-(

Further, while purusing the PluginManifestParser class looking for a solution to my problem, I believe that I have found a bug. First off, I wanted to let you know that I'm working with the 0.6 version of Nutch. So, inside the PluginManifestParser, I found the place where it's loading the libraries. Well, when it looks for the "export" sub-element in the library element within the plugin.xml file, there is actually a typo that is causing it to not function correctly on exported libraries. The typo is the following:

/**

* @param rootElement

* @param pluginDescriptor

private static void parseLibraries(Element pRootElement,

PluginDescriptor pDescriptor) throws MalformedURLException {

Element runtime = pRootElement.element("runtime");

if (runtime == null)

return;

List libraries = runtime.elements("library");

for (int i = 0; i < libraries.size(); i++) {

Element library = (Element) libraries.get(i);

String libName = library.attributeValue("name");

//@Bug Fix

//By: Chris Mattmann and Paul Ramirez

Element exportElement = library.element("export"); //used to read "extport”

if (exportElement != null)

pDescriptor.addExportedLibRelative(libName);

else

pDescriptor.addNotExportedLibRelative(libName);

}

So, basically the xml child element name that it was looking for was misspelled. Since I don’t know how to commit to the Nutch source tree (or if it’s even allowed), I just wanted to pass this bug fix (I think it’s a bug correct me if I’m wrong) to you guys. I’m also very interested in becoming a committer/helping out on the project. I think it’s really cool.

So, yeah if you guys could help me with my plugin problem, it would be much appreciated. I’m doing this RSS plugin as part of my Cs599: Seminar on Search Engines Ph.D. course at the University of Southern California.

Thanks a lot.

Cheers,

Chris

______________________________________________

Chris A. Mattmann

[EMAIL PROTECTED]

Staff Member

Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_________________________________________________

Jet Propulsion Laboratory Pasadena, CA

Office: 171-266B Mailstop: 171-246

Phone: 818-354-8810

_______________________________________________________

Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.

> -----Original Message-----

> From: John X [mailto:[EMAIL PROTECTED]

> Sent: Friday, March 25, 2005 6:24 PM

> To: [email protected]; J?r?me Charron

> Cc: [EMAIL PROTECTED]

> Subject: Re: Mime/Magic mapper

> On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote:

> > Does somebody know why John Xing deactivate the mime.magic.file

> > support in protocol-file plugin?

> The "disabled" are only hooks to use mimetype/magic mapper.

> The mapper I used in a project had license issue (can't be redistributed).

> There is no mapper code in nutch. That's about one year ago.

> If you know one without license issue, I will be happy to add it in.

> John

> > I'm writing an mbox-parser plugin, and typically, an mbox has no

> > extension => it's mime type could not be determined using

> > extension/mime-type mapper.

> > For an mbox, the mime-type can only be defined by "analyzing" the file

> > content (using a mime-type/magic mapper).

> >

> > Thanks

> >

> > J?r?me

> >

> > --

> > http://motrech.free.fr/ - motrech [home]

> > http://motrech.blogspot.com/ - motrech [blog]

> > http://fr.groups.yahoo.com/group/motrech - motrech [liste]

> > http://fr.groups.yahoo.com/group/frutch - frutch [liste]

> >

> __________________________________________

> http://www.neasys.com - A Good Place to Be

> Come to visit us today!

import java.net.URL;
import java.net.URLClassLoader;


public class test2{

    public test2(){}

    public static void main (String [] args) throws Exception{
        test2 t = new test2();
        URL [] theURLs = new URL[7];

        theURLs[5] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/saxpath.jar");
        theURLs[6] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar");
        //theURLs[5] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar");
        theURLs[0] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/parse-rss.jar");
        theURLs[1] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/classes/");
        theURLs[2] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar");
        theURLs[3] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/log4j-1.2.6.jar");
        theURLs[4] = new 
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-httpclient-3.0-beta1.jar");
        System.out.println(t.getClass().getName());
        URLClassLoader theLoader = new URLClassLoader(theURLs, 
t.getClass().getClassLoader());
        //      theLoader.loadClass("org.jdom.Document");

        Class c = theLoader.loadClass("net.nutch.parse.rss.RSSParser");
        Object o = (c.getConstructors()[0]).newInstance(null);
        c.getMethod("testMain",null).invoke(o,null); //this works fine!
        
        //org.jdom.Document d = new org.jdom.Document();
    }

}

RE: Huge Problem trying to develop plugin for Nutch

Reply via email to