[ http://issues.apache.org/jira/browse/NUTCH-30?page=comments#action_62133 
]
     
Andrzej Bialecki  commented on NUTCH-30:
----------------------------------------

A couple of comments:

* there was a recent thread on the list discussing a motion to reduce 
dependencies on external XML-related libraries, and instead to rely on the 
JDK-supplied XML parser. I noticed that your plugin uses dom4j. I'm not sure 
how easy/feasible/sensible would be to get rid of that dependency...?

* the parse-rss-patch.txt patch introduces huge white-space diffs. This is 
confusing, and it should be removed before the patch is applied.

* the method transformDocument creates a new Transformer for every call. 
Transformers are not multithreaded, so we cannot use just a single instance, 
but each transformation can be a CPU and time-intensive process. Perhaps this 
should be converted to use a pool of pre-allocated Transformers, as many as 
fetcher.threads conf. variable.

> rss feed parser
> ---------------
>
>          Key: NUTCH-30
>          URL: http://issues.apache.org/jira/browse/NUTCH-30
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Minor
>  Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-patch.txt, 
> parse-rss.zip
>
> A simple rss feed parser supporting:
> rss and atom:
> + version 0.3
> +  version 09
> + version 10
> + version 20
> Converting of different rss versions  is done via xslt. 
> The xslt was contributed by Frank Henze - Thanks!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

Reply via email to