[jira] Updated: (NUTCH-30) rss feed parser

Chris A. Mattmann (JIRA) Wed, 06 Apr 2005 14:06:35 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-30?page=history ]


Chris A. Mattmann updated NUTCH-30:
-----------------------------------

    Attachment: parse-rss-1.0-040605.zip

Hi Folks,

 Okay, I've addressed the comments put forth by Andrzej Bialecki regarding the 
parse-rss plugin that I submitted, and the latest version of it is attached. 
Included in the zip file is basically the directory for the plugin, which is 
self-contained. I haven't attached the svn diff/patch file with this, but here 
are the following things that would need to be done to include this in the 
source tree:

1. enable/add the plugin in nutch-default.xml (the name of the plugin is 
parse-rss)
2. add the plugin parse-rss in src/plugin/build.xml in the "deploy", "test" and 
"clean" targets
3. edit conf/mime.types to add the following two lines in the file:

application/rss+xml             rss
text/xml                        rss

4. That's it.

I can attach an SVN diff of this from the latest nutch snapshot later on 
tonight, however just to note, this attachment includes the following updates:

> * package names follow the old naming, the new naming is under
> org.apache.nutch.*

Fixed, tested, and included in the zip file 
> 
> * in RSSParser.java, you retrieve contentLength, but the code never 
> uses it.

Took this part out:  included in the zip file.

> 
> * lines 149-160 seem a bit bogus to me. As I understand the RSS spec, 
> the item's permalink should be preferred _if present_, but it's not an 
> error if it's absent (as signified by a null value, which currently 
> causes MalformedURLException to be thrown) - in such case the getLink 
> should be used instead. The message in line 157 is wrong, too, because 
> it prints the url of the channel, and not the current item. When it's 
> fixed, it would be also good to demonstrate such fallback in the test 
> case.

Fixed this part: now, I first check for the perm link, if it's null, I try the 
regular "link" field, and if there's nothing there, then I make it choke (i.e. 
throw MalformedURLException). I hope it's a bit more robust

> 
> * I'm not sure what is the purpose of copying through the metadata - 
> the code doesn't modify the copy, so you could as well use the 
> original, right?

took out

> 
> * probably just a matter of programming style, but I'm always somewhat 
> vexed by frequent String concatenations, especially in a "for" loop - 
> like in the code that creates the title and the body. StringBuffer-s 
> would be a good fit here...

it now uses StringBuffer.append instead of concats


> 
> * IMHO it's better to put the various intermediate diagnostic output 
> from the plugin under LOG.fine(), to reduce the amount of information 
> to be logged. The final result of processing the content could be put 
> under
> LOG.info() or LOG.warn(), depending on the final result. (I personally 
> favor no output whatsoever if everything went ok).

Yup, I agree. I've put all not necessary (i.e. non exception) logging to go to 
LOG.fine, and included in the zip file

> 
> And lastly, a minor thing, but still... the formatting style and 
> indentation in most files doesn't adhere to the Nutch coding style 
> when it comes to whitespace rules - please see e.g. WebDBReader.java 
> as a reference. This especially concerns the whitespace around curly 
> braces and assignments, and the use of literal Tab instead of 4 
> spaces. This is easy to fix with an IDE, but it helps a lot when 
> someone else is reading the code...

Fixed.





> rss feed parser
> ---------------
>
>          Key: NUTCH-30
>          URL: http://issues.apache.org/jira/browse/NUTCH-30
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Minor
>  Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-1.0-040605.zip, 
> parse-rss-patch.txt, parse-rss.zip
>
> A simple rss feed parser supporting:
> rss and atom:
> + version 0.3
> +  version 09
> + version 10
> + version 20
> Converting of different rss versions  is done via xslt. 
> The xslt was contributed by Frank Henze - Thanks!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-30) rss feed parser

Reply via email to