Hi Andrzej,
Yeah, actually I was the one that initiated that thread about the XML parsing libraries ;) Kinda funny how my plugin uses one huh? :-)
:-)
The plugin I submitted uses jdom actually (although it's a moot point
[..]
Ah, thanks for the clarification. So, what this boils down to is that dom4j is there to stay. I don't mind it, I was just curious.
As for the patch having large white-space in the diffs, I can fix that with a perl script. I'll try and fix that by tonight.
Perhaps try to run a 'svn diff -x b' to ignore whitespace changes, and then only fix the remaining lines that really differ.
With respect to the transformDocument commment, my RSS Parser doesn't use that function: that is from the one that Stefan submitted earlier before he could find my code and look at it. The two files that I submitted (that comprise my plugin) are:
parse-rss-patch.txt
parse-rss.zip
Ah, ok - then I was looking at the wrong files altogether. I now had a look at the source in parse-rss.zip. If you don't mind, here's a couple of new comments:
* package names follow the old naming, the new naming is under org.apache.nutch.*
* in RSSParser.java, you retrieve contentLength, but the code never uses it.
* lines 149-160 seem a bit bogus to me. As I understand the RSS spec, the item's permalink should be preferred _if present_, but it's not an error if it's absent (as signified by a null value, which currently causes MalformedURLException to be thrown) - in such case the getLink should be used instead. The message in line 157 is wrong, too, because it prints the url of the channel, and not the current item. When it's fixed, it would be also good to demonstrate such fallback in the test case.
* I'm not sure what is the purpose of copying through the metadata - the code doesn't modify the copy, so you could as well use the original, right?
* probably just a matter of programming style, but I'm always somewhat vexed by frequent String concatenations, especially in a "for" loop - like in the code that creates the title and the body. StringBuffer-s would be a good fit here...
* IMHO it's better to put the various intermediate diagnostic output from the plugin under LOG.fine(), to reduce the amount of information to be logged. The final result of processing the content could be put under LOG.info() or LOG.warn(), depending on the final result. (I personally favor no output whatsoever if everything went ok).
And lastly, a minor thing, but still... the formatting style and indentation in most files doesn't adhere to the Nutch coding style when it comes to whitespace rules - please see e.g. WebDBReader.java as a reference. This especially concerns the whitespace around curly braces and assignments, and the use of literal Tab instead of 4 spaces. This is easy to fix with an IDE, but it helps a lot when someone else is reading the code...
Thanks again for this contribution!
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
