Hi Andrzej,

 Yeah, actually I was the one that initiated that thread about the XML
parsing libraries ;) Kinda funny how my plugin uses one huh? :-)

The plugin I submitted uses jdom actually (although it's a moot point
whether it uses jdom, or dom4j, etc.). The jdom dependency comes from jaxen,
which the commons-feedparser uses. The nice thing about the
commons-feedparser component is its SAX-based (event style) parsing model,
and its ability to handle virtually all of the different RSS feed styles
(Atom, RSS 1.0, 2.0, etc.).

The original discussion about the different XML parsing APIs arose out of
Nutch's reliance (at the time) on dom4j 1.4.2, which had some external jaxen
API classes included in it, which caused namespace conflicts with various
other XML parsing APIs. Therefore, those who wrote plugins for Nutch before
the dom4j in the $NUTCH_HOME/lib directory was upgraded to 1.5.2, and who
needed jaxen, or dom4j, or other XML reading APIs in their plugins, would
have had namespace conflicts like myself. So, Doug upgraded Nutch to rely on
dom4j 1.5.2, which doesn't include the additional jaxen classes, and that
problem has been alleviated (for now of course, until the next XML API
conflict comes along ;) ).

As for the patch having large white-space in the diffs, I can fix that with
a perl script. I'll try and fix that by tonight.

With respect to the transformDocument commment, my RSS Parser doesn't use
that function: that is from the one that Stefan submitted earlier before he
could find my code and look at it. The two files that I submitted (that
comprise my plugin) are:

 parse-rss-patch.txt
 parse-rss.zip 


Thanks for your comments and I hope that the Nutch community can benefit
from the plugin.


Cheers,
  Chris


On 4/4/05 12:14 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> Hi Folks,
>> 
>>  I just wanted to let you know that I�ve submitted the parse-rss plugin that
>> I was working on to the JIRA system under issue �NUTCH-30�
>> (http://issues.apache.org/jira/browse/NUTCH-30). The plugin includes a patch
>> filie (svn diff), along with the zipped up source and runtime libraries. The
>> rss parser is based on the commons-feedparser out of the jakarta sandbox,
>> and fully supports all of the major rss formats (atom, rss 1.0, 2.0, etc.).
>> Additionally, I�ve included a junit test that runs the parser on an example
>> rss file and validates the outlinks and content extracted.
>> 
>> I hope that you will find it useful and vote to have it included in the
>> nutch distro.
> 
> +1, with some reservations (see jira).
> 
> I think it's a very useful contribution. Thank you, Chris!

______________________________________________
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 



Reply via email to