[ http://issues.apache.org/jira/browse/NUTCH-30?page=history ]

Chris A. Mattmann updated NUTCH-30:
-----------------------------------

    Attachment: parse-rss.zip
                parse-rss-patch.txt

An RSS Parser plugin based on the Apache Commons-Feedparser RSS Parsing 
library. From the feedparser site 
(http://jakarta.apache.org/commons/sandbox/feedparser/):

Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all 
versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) 
as well as easy ad hoc extension and RSS 1.0 modules capability. 

FeedParser was the parser API designed by Kevin Burton for NewsMonster and has 
been donated to the ASF in order to continue development. 

FeedParser differs from most other RSS/Atom parsers in that it is not DOM based 
but event based (similar to SAX). Instead of the low level startElement() API 
present in SAX, we provide higher level events based on feed parsing 
information. 

Events are also given to the caller independent of the underlying format. This 
is accomplished with a Feed Event Model that isolates your application from the 
underlying feed format. This enables transparent support for all RSS versions 
including Atom. We also hide format specific implementation such as dates (RFC 
822 in RSS 2.0 and 0.9x and ISO 8601 in RSS 1.0 and Atom) and other metadata. 

The FeedParser distribution also includes: 

1. An implementation of RSS and Atom autodiscovery. 
2. Support for all content modules including xhtml:body, mod_content (RDF and 
inline), atom:content, and atom:summary 
3. Atom 1.0 link API as well as RSS 1.0 mod_link API 
4. An HTML link parser for finding all links in an HTML source file and 
expanding them to become full URLs instead of relative. 

I've included the zipped up parse-rss nutch plugin, along with a patch file 
generated from svn diff. Hopefully you guys will find the parser useful, and 
please vote for it if you would like to include it in the Nutch source tree.

Cheers,
  Chris




> rss feed parser
> ---------------
>
>          Key: NUTCH-30
>          URL: http://issues.apache.org/jira/browse/NUTCH-30
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Minor
>  Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-patch.txt, 
> parse-rss.zip
>
> A simple rss feed parser supporting:
> rss and atom:
> + version 0.3
> +  version 09
> + version 10
> + version 20
> Converting of different rss versions  is done via xslt. 
> The xslt was contributed by Frank Henze - Thanks!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

Reply via email to