[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472596
 ] 

nutch.newbie commented on NUTCH-444:
------------------------------------

Hi Dogacan:

I have done some digging around Rome yesterday and it seems to me that rome 
treats RSS i.e authors.getName differently then Atom authors.getName same goes 
for description, content and category... i.e. some values are returned nill 
some throws a exception. Could this be a cos of the problem.. cos all the CNN 
rss passed with flying colors .. http://www.cnn.com/services/rss/

Brain Storming here.. maybe its a good idea to chop the parser into two parser 
i.e. parser-feed (link, title, content -- the needed basics) and 
parser-feedextra (everything else and more) good idea? bad idea? I don't know.. 
just wondering .. My use case was those who are not making a blog search could 
just use parser-feed to index basic stuff thus saving parsing and indexing 
time. Those who are going for blog search will have both parser enables in 
nutch-site.xml.. Just some thoughts..

I will try to send you some problem URL directly via mail. 

Regards

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to