[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472099 ]
nutch.newbie commented on NUTCH-444: ------------------------------------ Otis: Thanks for the info. But as for me I am going with parse-feed. I will also like to give stax based solution a try. Dogacan: It's working rather well with parse-feed. However I would be glad if you could do a quick check on my parse-plugins.xml modifications. Cos this also throws error during dedup... (when magic is false in nutch-site.xml). My intention is to know if its something I am doing wrong or is it some other bug.. I am thinking of doing a test run later tonight with 10 000 feeds. So I would be glad if you could clarify the following cases. (The following case only happens when there is just 1 url) - urls.txt file contains 1 url, which is http://blog.foofactory.fi/atom.xml - bin/nutch crawl with depth 1 gives me the following error during dedup 2007-02-11 13:32:26,846 WARN mapred.LocalJobRunner - job_k9e9c2 java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) at org.apache.hadoop.mapred.MapTask$2.next(MapTask.java:166) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109) and during the parse phase of the above blog gives me the following: 2007-02-11 13:32:09,673 DEBUG http.Http - fetched 208 bytes from http://blog.foofactory.fi/robots.txt 2007-02-11 13:32:09,674 DEBUG http.Http - fetching http://blog.foofactory.fi/atom.xml 2007-02-11 13:32:10,560 INFO mapred.JobClient - map 100% reduce 0% 2007-02-11 13:32:10,769 DEBUG http.Http - fetched 38151 bytes from http://blog.foofactory.fi/atom.xml 2007-02-11 13:32:10,965 DEBUG parse.ParseUtil - Parsing [http://blog.foofactory.fi/atom.xml] with [EMAIL PROTECTED] 2007-02-11 13:32:11,292 INFO mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 2007-02-11 13:32:11,627 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2007-02-11 13:32:11,654 WARN fetcher.Fetcher - Error parsing: http://blog.foofactory.fi/atom.xml: failed(2,200): java.lang.NullPointerException 2007-02-11 13:32:12,293 INFO mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 pages/s, 99 kb/s, 2007-02-11 13:32:12,306 DEBUG mapred.MapTask - opened spill0.out 2007-02-11 13:32:12,381 INFO mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 pages/s, 99 kb/s, Below is my Parse-plugins.xml changes... <mimeType name="application/rss+xml"> <plugin id="parse-feed" /> </mimeType> <mimeType name="text/xml"> <plugin id="parse-feed" /> </mimeType> <alias name="parse-feed" extension-id="org.apache.nutch.parse.feed.FeedParser" /> I have also mapped text/xml in parse-feed/plugin.xml cos most of the time I get xml rather then rss+xml as content type.. Also as you mentioned you are using this to test .. how is your test configuration? can you re-create my problem.. Thanks again for the plugin and many thanks for your help. I look forward to contribute in terms of index-feed and query-feed. > Possibly use a different library to parse RSS feed for improved performance > and compatibility > --------------------------------------------------------------------------------------------- > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Priority: Minor > Fix For: 0.9.0 > > Attachments: parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers