[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-444:
------------------------------------

    Attachment: NUTCH-444.Mattmann.061707.patch.txt

Hi Folks,

 Here is a patch that brings this issue up-to-date. The patch takes Doğacan's 
initial patch, and cleans it up in many places, e.g.:

* changed ParseStatus.STATUS_FAILURE on failed parse (was 
ParseStatus.STATUS_SUCCESS) - line 271
* reformatted code to conform to project style
* removed magic strings
* added in Apache license
* added in unit test
* fixed build.xml file to include refs to nutch-extensionpoints dep during unit 
test

 While I think there are a few minor open questions moving forward, I don't see 
any of them hindering the committal of this patch. In answer to my above 
referenced question regarding this issue as well, I noticed that all-in-all, 
the feed plugin provided here does provide a superset of functionality provided 
by that of parse-rss. So, I am +1 for removing parse-rss. Some things to 
consider going forward:

1. I did find one difference in semantics between the parse-rss plugin and the 
feed plugin: the feed plugin adds the URL pointer to the channel file as the 
Text entry in the <Text, Parse> map provided in the ParseResult class. While 
this is probably the correct thing to do, it was causing me some grief 
initially b/c it caused my unit test to fail. My unit test was expecting to 
receive the url: http://test.channel.com, the identified URL in the rsstest.rss 
file, provided as sample input for the unit test. However, since the feed 
plugin parser takes the *actual* URL pointer to the channel file (e.g., 
file:/some/path/on/your/system/rsstest.rss), rather than the specified channel 
URL, this test was failing. The old parse-rss plugin actually took the channel 
URL instead. I thought about this, and it's not a major hurdle. I think the 
semantics of simply taking the URL pointer to the channel file that was used 
(even if it was a file: pointer), is fine.

2. It might be a good idea to factor out the desired index/parse properties 
taken from the feed and allow them to be specified by a configuration file to 
this plugin. In other words, wouldn't it be nice to tell the plugin which 
fields we want to extract (e.g., author, published date, etc.)? This would be 
an improvement to this plugin later on.

Okey dok, so here it is. If there are no objections, I'd like to commit this in 
the next 48 hrs. I'd also like feedback from folks like Andrzej and Doğacan 
regarding removing parse-rss from the sources.

Thanks!

Cheers,
  Chris



> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, 
> NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to