|
Hi, For whatever reason (maybe file
filtering) I think that my test2.java file that I attached didn’t go
through. So, I renamed the extension to .txt. Let’s see if it goes
through this time. Sorry about having to send another email.
Thanks very much for any help! Cheers, Chris From: Chris Mattmann
[mailto:[EMAIL PROTECTED] Hi Folks, My name is Chris Mattmann: I work at the Jet Propulsion
Laboratory in I am having some * major * trouble trying to build an RSS content
parser plugin for nutch. My plugin is based on the parse-pdf plugin structure
and uses the apache commons-feedparser library out of the http://baron.pagemewhen.com:8080/~chris/hi.rss So, basically then I seed my crawler with the
baron.pagemewhen.com:8080/~chris/ webpage, and then tell it to go get the
content and start parsing via the ./bin/nutch crawl command. So, then when it's
crawling I get the attached output in the nutch-crawl-log.txt file(along with
print lines that I've inserted into the nutch source code myself so I can see
what's happening: these are denoted by the (&(&(& CHRIS variants).
I've went round and round in the PluginRepository/PluginDescriptor classes in
the net.nutch.plugin package, and I pretty now fully understand how everything
is working and how the pluginclassloader is loading the classes. You can even
see in my log file that it got all the correct classes in my classpath. The
files are located in the right directory, as all the class path urls to the jar
files that it captured I have verified. Further, I wrote a test2.java program
that simulates dynamically loading the rss parser class uses an URLCLassLoader,
and for whatever strange reason, that same code works! Just not in Nutch. I've
attached this program as well for your convenience (the test2.java) attachment.
I've also attached my rss plugin, along with dependent jar files in the plugin
structure, along with my plugin.xml file. The plugin is located here in a zip
file: http://baron.pagemewhen.com:8080/~chris/parse-rss.zip
. Can someone please give me some idea as to what I'm doing wrong here???? I am
so frustrated I'm pulling my hair out :-( Further, while purusing the PluginManifestParser class looking for a
solution to my problem, I believe that I have found a bug. First off, I wanted
to let you know that I'm working with the 0.6 version of Nutch. So, inside the
PluginManifestParser, I found the place where it's loading the libraries. Well,
when it looks for the "export" sub-element in the library element
within the plugin.xml file, there is actually a typo that is causing it to not
function correctly on exported libraries. The typo is the following: /** * @param rootElement * @param pluginDescriptor */ private static void parseLibraries(Element pRootElement,
PluginDescriptor pDescriptor) throws MalformedURLException { Element runtime =
pRootElement.element("runtime"); if (runtime == null) return; List libraries =
runtime.elements("library"); for (int i = 0; i < libraries.size(); i++) { Element library = (Element)
libraries.get(i); String libName =
library.attributeValue("name"); //@Bug Fix //By: Chris Mattmann and Paul Ramirez Element exportElement =
library.element("export"); //used to read "extport” if (exportElement != null)
pDescriptor.addExportedLibRelative(libName); else pDescriptor.addNotExportedLibRelative(libName); } } So, basically the xml child element name that it was looking for was
misspelled. Since I don’t know how to commit to the Nutch source tree (or
if it’s even allowed), I just wanted to pass this bug fix (I think
it’s a bug correct me if I’m wrong) to you guys. I’m also
very interested in becoming a committer/helping out on the project. I think
it’s really cool. So, yeah if you guys could help me with my plugin problem, it would be
much appreciated. I’m doing this RSS plugin as part of my Cs599: Seminar
on Search Engines Ph.D. course at the Thanks a lot. Cheers, Chris ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion
Laboratory Office:
171-266B
Mailstop: 171-246 Phone: 818-354-8810 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not
reflect those of either NASA, JPL, or the California Institute of Technology. > -----Original Message----- > From: John X [mailto:[EMAIL PROTECTED] > Sent: Friday, > To: [email protected]; J?r?me Charron > Cc: [EMAIL PROTECTED] > Subject: Re: Mime/Magic mapper > > On Sat, > > Does somebody know why John Xing deactivate the
mime.magic.file > > support in protocol-file plugin? > > The "disabled" are only hooks to use mimetype/magic
mapper. > The mapper I used in a project had license issue (can't be
redistributed). > There is no mapper code in nutch. That's about one year ago. > If you know one without license issue, I will be happy to add it
in. > > John > > > I'm writing an mbox-parser plugin, and typically, an mbox has
no > > extension => it's mime type could not be determined using > > extension/mime-type mapper. > > For an mbox, the mime-type can only be defined by
"analyzing" the file > > content (using a mime-type/magic mapper). > > > > Thanks > > > > J?r?me > > > > > > > > -- > > http://motrech.free.fr/ - motrech [home] > > http://motrech.blogspot.com/ - motrech [blog] > > http://fr.groups.yahoo.com/group/motrech - motrech [liste] > > http://fr.groups.yahoo.com/group/frutch - frutch [liste] > > > __________________________________________ > http://www.neasys.com - > Come to visit us today! |
import java.net.URL; import java.net.URLClassLoader;
public class test2{
public test2(){}
public static void main (String [] args) throws Exception{
test2 t = new test2();
URL [] theURLs = new URL[7];
theURLs[5] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/saxpath.jar");
theURLs[6] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jaxen-full.jar");
//theURLs[5] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/jdom.jar");
theURLs[0] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/parse-rss.jar");
theURLs[1] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/classes/");
theURLs[2] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-feedparser-0.5-beta.jar");
theURLs[3] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/log4j-1.2.6.jar");
theURLs[4] = new
URL("file:/home/chris/cs599-search-engines/nutch/build/plugins/parse-rss/commons-httpclient-3.0-beta1.jar");
System.out.println(t.getClass().getName());
URLClassLoader theLoader = new URLClassLoader(theURLs,
t.getClass().getClassLoader());
// theLoader.loadClass("org.jdom.Document");
Class c = theLoader.loadClass("net.nutch.parse.rss.RSSParser");
Object o = (c.getConstructors()[0]).newInstance(null);
c.getMethod("testMain",null).invoke(o,null); //this works fine!
//org.jdom.Document d = new org.jdom.Document();
}
}
