Hi Andrzej, Thanks for all the comments. I will update the parse-rss code to abide by those guidelines, and will try to get out an update in the next day or so.
Thanks for looking at the code so carefully! I'll also fix the patch file using your suggested svn command. Take care, Chris ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 Phone: 818-354-8810 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. > -----Original Message----- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Monday, April 04, 2005 3:17 PM > To: [email protected] > Subject: Re: RSS Parser Plugin based on commons-feedparser submitted > > Chris Mattmann wrote: > > Hi Andrzej, > > > > Yeah, actually I was the one that initiated that thread about the XML > > parsing libraries ;) Kinda funny how my plugin uses one huh? :-) > > > > :-) > > > The plugin I submitted uses jdom actually (although it's a moot point > [..] > > Ah, thanks for the clarification. So, what this boils down to is that > dom4j is there to stay. I don't mind it, I was just curious. > > > > > As for the patch having large white-space in the diffs, I can fix that > with > > a perl script. I'll try and fix that by tonight. > > > > Perhaps try to run a 'svn diff -x b' to ignore whitespace changes, and > then only fix the remaining lines that really differ. > > > With respect to the transformDocument commment, my RSS Parser doesn't > use > > that function: that is from the one that Stefan submitted earlier before > he > > could find my code and look at it. The two files that I submitted (that > > comprise my plugin) are: > > > > parse-rss-patch.txt > > parse-rss.zip > > Ah, ok - then I was looking at the wrong files altogether. I now had a > look at the source in parse-rss.zip. If you don't mind, here's a couple > of new comments: > > * package names follow the old naming, the new naming is under > org.apache.nutch.* > > * in RSSParser.java, you retrieve contentLength, but the code never uses > it. > > * lines 149-160 seem a bit bogus to me. As I understand the RSS spec, > the item's permalink should be preferred _if present_, but it's not an > error if it's absent (as signified by a null value, which currently > causes MalformedURLException to be thrown) - in such case the getLink > should be used instead. The message in line 157 is wrong, too, because > it prints the url of the channel, and not the current item. When it's > fixed, it would be also good to demonstrate such fallback in the test > case. > > * I'm not sure what is the purpose of copying through the metadata - the > code doesn't modify the copy, so you could as well use the original, > right? > > * probably just a matter of programming style, but I'm always somewhat > vexed by frequent String concatenations, especially in a "for" loop - > like in the code that creates the title and the body. StringBuffer-s > would be a good fit here... > > * IMHO it's better to put the various intermediate diagnostic output > from the plugin under LOG.fine(), to reduce the amount of information to > be logged. The final result of processing the content could be put under > LOG.info() or LOG.warn(), depending on the final result. (I personally > favor no output whatsoever if everything went ok). > > And lastly, a minor thing, but still... the formatting style and > indentation in most files doesn't adhere to the Nutch coding style when > it comes to whitespace rules - please see e.g. WebDBReader.java as a > reference. This especially concerns the whitespace around curly braces > and assignments, and the use of literal Tab instead of 4 spaces. This is > easy to fix with an IDE, but it helps a lot when someone else is reading > the code... > > Thanks again for this contribution! > > > -- > Best regards, > Andrzej Bialecki > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com
