Hi Andrzej, Yeah, actually I was the one that initiated that thread about the XML parsing libraries ;) Kinda funny how my plugin uses one huh? :-)
The plugin I submitted uses jdom actually (although it's a moot point whether it uses jdom, or dom4j, etc.). The jdom dependency comes from jaxen, which the commons-feedparser uses. The nice thing about the commons-feedparser component is its SAX-based (event style) parsing model, and its ability to handle virtually all of the different RSS feed styles (Atom, RSS 1.0, 2.0, etc.). The original discussion about the different XML parsing APIs arose out of Nutch's reliance (at the time) on dom4j 1.4.2, which had some external jaxen API classes included in it, which caused namespace conflicts with various other XML parsing APIs. Therefore, those who wrote plugins for Nutch before the dom4j in the $NUTCH_HOME/lib directory was upgraded to 1.5.2, and who needed jaxen, or dom4j, or other XML reading APIs in their plugins, would have had namespace conflicts like myself. So, Doug upgraded Nutch to rely on dom4j 1.5.2, which doesn't include the additional jaxen classes, and that problem has been alleviated (for now of course, until the next XML API conflict comes along ;) ). As for the patch having large white-space in the diffs, I can fix that with a perl script. I'll try and fix that by tonight. With respect to the transformDocument commment, my RSS Parser doesn't use that function: that is from the one that Stefan submitted earlier before he could find my code and look at it. The two files that I submitted (that comprise my plugin) are: parse-rss-patch.txt parse-rss.zip Thanks for your comments and I hope that the Nutch community can benefit from the plugin. Cheers, Chris On 4/4/05 12:14 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > Chris Mattmann wrote: >> Hi Folks, >> >> I just wanted to let you know that I�ve submitted the parse-rss plugin that >> I was working on to the JIRA system under issue �NUTCH-30� >> (http://issues.apache.org/jira/browse/NUTCH-30). The plugin includes a patch >> filie (svn diff), along with the zipped up source and runtime libraries. The >> rss parser is based on the commons-feedparser out of the jakarta sandbox, >> and fully supports all of the major rss formats (atom, rss 1.0, 2.0, etc.). >> Additionally, I�ve included a junit test that runs the parser on an example >> rss file and validates the outlinks and content extracted. >> >> I hope that you will find it useful and vote to have it included in the >> nutch distro. > > +1, with some reservations (see jira). > > I think it's a very useful contribution. Thank you, Chris! ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 Phone: 818-354-8810 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
