[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated NUTCH-444: ------------------------------------ Attachment: NUTCH-444.Mattmann.061707.patch.txt Hi Folks, Here is a patch that brings this issue up-to-date. The patch takes Doğacan's initial patch, and cleans it up in many places, e.g.: * changed ParseStatus.STATUS_FAILURE on failed parse (was ParseStatus.STATUS_SUCCESS) - line 271 * reformatted code to conform to project style * removed magic strings * added in Apache license * added in unit test * fixed build.xml file to include refs to nutch-extensionpoints dep during unit test While I think there are a few minor open questions moving forward, I don't see any of them hindering the committal of this patch. In answer to my above referenced question regarding this issue as well, I noticed that all-in-all, the feed plugin provided here does provide a superset of functionality provided by that of parse-rss. So, I am +1 for removing parse-rss. Some things to consider going forward: 1. I did find one difference in semantics between the parse-rss plugin and the feed plugin: the feed plugin adds the URL pointer to the channel file as the Text entry in the <Text, Parse> map provided in the ParseResult class. While this is probably the correct thing to do, it was causing me some grief initially b/c it caused my unit test to fail. My unit test was expecting to receive the url: http://test.channel.com, the identified URL in the rsstest.rss file, provided as sample input for the unit test. However, since the feed plugin parser takes the *actual* URL pointer to the channel file (e.g., file:/some/path/on/your/system/rsstest.rss), rather than the specified channel URL, this test was failing. The old parse-rss plugin actually took the channel URL instead. I thought about this, and it's not a major hurdle. I think the semantics of simply taking the URL pointer to the channel file that was used (even if it was a file: pointer), is fine. 2. It might be a good idea to factor out the desired index/parse properties taken from the feed and allow them to be specified by a configuration file to this plugin. In other words, wouldn't it be nice to tell the plugin which fields we want to extract (e.g., author, published date, etc.)? This would be an improvement to this plugin later on. Okey dok, so here it is. If there are no objections, I'd like to commit this in the next 48 hrs. I'd also like feedback from folks like Andrzej and Doğacan regarding removing parse-rss from the sources. Thanks! Cheers, Chris > Possibly use a different library to parse RSS feed for improved performance > and compatibility > --------------------------------------------------------------------------------------------- > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, > NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers