[ http://issues.apache.org/jira/browse/NUTCH-30?page=history ]
Chris A. Mattmann updated NUTCH-30:
-----------------------------------
Attachment: parse-rss.zip
parse-rss-patch.txt
An RSS Parser plugin based on the Apache Commons-Feedparser RSS Parsing
library. From the feedparser site
(http://jakarta.apache.org/commons/sandbox/feedparser/):
Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all
versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions)
as well as easy ad hoc extension and RSS 1.0 modules capability.
FeedParser was the parser API designed by Kevin Burton for NewsMonster and has
been donated to the ASF in order to continue development.
FeedParser differs from most other RSS/Atom parsers in that it is not DOM based
but event based (similar to SAX). Instead of the low level startElement() API
present in SAX, we provide higher level events based on feed parsing
information.
Events are also given to the caller independent of the underlying format. This
is accomplished with a Feed Event Model that isolates your application from the
underlying feed format. This enables transparent support for all RSS versions
including Atom. We also hide format specific implementation such as dates (RFC
822 in RSS 2.0 and 0.9x and ISO 8601 in RSS 1.0 and Atom) and other metadata.
The FeedParser distribution also includes:
1. An implementation of RSS and Atom autodiscovery.
2. Support for all content modules including xhtml:body, mod_content (RDF and
inline), atom:content, and atom:summary
3. Atom 1.0 link API as well as RSS 1.0 mod_link API
4. An HTML link parser for finding all links in an HTML source file and
expanding them to become full URLs instead of relative.
I've included the zipped up parse-rss nutch plugin, along with a patch file
generated from svn diff. Hopefully you guys will find the parser useful, and
please vote for it if you would like to include it in the Nutch source tree.
Cheers,
Chris
> rss feed parser
> ---------------
>
> Key: NUTCH-30
> URL: http://issues.apache.org/jira/browse/NUTCH-30
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Stefan Grroschupf
> Priority: Minor
> Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-patch.txt,
> parse-rss.zip
>
> A simple rss feed parser supporting:
> rss and atom:
> + version 0.3
> + version 09
> + version 10
> + version 20
> Converting of different rss versions is done via xslt.
> The xslt was contributed by Frank Henze - Thanks!
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira