[ http://issues.apache.org/jira/browse/NUTCH-30?page=comments#action_62133 ] Andrzej Bialecki commented on NUTCH-30: ----------------------------------------
A couple of comments: * there was a recent thread on the list discussing a motion to reduce dependencies on external XML-related libraries, and instead to rely on the JDK-supplied XML parser. I noticed that your plugin uses dom4j. I'm not sure how easy/feasible/sensible would be to get rid of that dependency...? * the parse-rss-patch.txt patch introduces huge white-space diffs. This is confusing, and it should be removed before the patch is applied. * the method transformDocument creates a new Transformer for every call. Transformers are not multithreaded, so we cannot use just a single instance, but each transformation can be a CPU and time-intensive process. Perhaps this should be converted to use a pool of pre-allocated Transformers, as many as fetcher.threads conf. variable. > rss feed parser > --------------- > > Key: NUTCH-30 > URL: http://issues.apache.org/jira/browse/NUTCH-30 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Stefan Grroschupf > Priority: Minor > Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-patch.txt, > parse-rss.zip > > A simple rss feed parser supporting: > rss and atom: > + version 0.3 > + version 09 > + version 10 > + version 20 > Converting of different rss versions is done via xslt. > The xslt was contributed by Frank Henze - Thanks! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira
