[ http://issues.apache.org/jira/browse/NUTCH-30?page=history ]
Chris A. Mattmann updated NUTCH-30:
-----------------------------------
Attachment: parse-rss-1.0-040605.zip
Hi Folks,
Okay, I've addressed the comments put forth by Andrzej Bialecki regarding the
parse-rss plugin that I submitted, and the latest version of it is attached.
Included in the zip file is basically the directory for the plugin, which is
self-contained. I haven't attached the svn diff/patch file with this, but here
are the following things that would need to be done to include this in the
source tree:
1. enable/add the plugin in nutch-default.xml (the name of the plugin is
parse-rss)
2. add the plugin parse-rss in src/plugin/build.xml in the "deploy", "test" and
"clean" targets
3. edit conf/mime.types to add the following two lines in the file:
application/rss+xml rss
text/xml rss
4. That's it.
I can attach an SVN diff of this from the latest nutch snapshot later on
tonight, however just to note, this attachment includes the following updates:
> * package names follow the old naming, the new naming is under
> org.apache.nutch.*
Fixed, tested, and included in the zip file
>
> * in RSSParser.java, you retrieve contentLength, but the code never
> uses it.
Took this part out: included in the zip file.
>
> * lines 149-160 seem a bit bogus to me. As I understand the RSS spec,
> the item's permalink should be preferred _if present_, but it's not an
> error if it's absent (as signified by a null value, which currently
> causes MalformedURLException to be thrown) - in such case the getLink
> should be used instead. The message in line 157 is wrong, too, because
> it prints the url of the channel, and not the current item. When it's
> fixed, it would be also good to demonstrate such fallback in the test
> case.
Fixed this part: now, I first check for the perm link, if it's null, I try the
regular "link" field, and if there's nothing there, then I make it choke (i.e.
throw MalformedURLException). I hope it's a bit more robust
>
> * I'm not sure what is the purpose of copying through the metadata -
> the code doesn't modify the copy, so you could as well use the
> original, right?
took out
>
> * probably just a matter of programming style, but I'm always somewhat
> vexed by frequent String concatenations, especially in a "for" loop -
> like in the code that creates the title and the body. StringBuffer-s
> would be a good fit here...
it now uses StringBuffer.append instead of concats
>
> * IMHO it's better to put the various intermediate diagnostic output
> from the plugin under LOG.fine(), to reduce the amount of information
> to be logged. The final result of processing the content could be put
> under
> LOG.info() or LOG.warn(), depending on the final result. (I personally
> favor no output whatsoever if everything went ok).
Yup, I agree. I've put all not necessary (i.e. non exception) logging to go to
LOG.fine, and included in the zip file
>
> And lastly, a minor thing, but still... the formatting style and
> indentation in most files doesn't adhere to the Nutch coding style
> when it comes to whitespace rules - please see e.g. WebDBReader.java
> as a reference. This especially concerns the whitespace around curly
> braces and assignments, and the use of literal Tab instead of 4
> spaces. This is easy to fix with an IDE, but it helps a lot when
> someone else is reading the code...
Fixed.
> rss feed parser
> ---------------
>
> Key: NUTCH-30
> URL: http://issues.apache.org/jira/browse/NUTCH-30
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Stefan Grroschupf
> Priority: Minor
> Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-1.0-040605.zip,
> parse-rss-patch.txt, parse-rss.zip
>
> A simple rss feed parser supporting:
> rss and atom:
> + version 0.3
> + version 09
> + version 10
> + version 20
> Converting of different rss versions is done via xslt.
> The xslt was contributed by Frank Henze - Thanks!
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira