Nutch Newbie wrote: > Hi: > > Could some please be kind enough to confirm if the 0.9-dev trunk is > broken. I did a total of 4 fresh install and every time I am getting > stuck in indexing/reduce process. (Yes Speculative = false). > > It would feel much better if I am not the only one with this problem! > > Thank you for your help. > > > On 1/8/07, Nutch Newbie <[EMAIL PROTECTED]> wrote: >> Hi: >> >> I am getting the following error after updating to revision 494024. My >> Hadoop-site.xml (mapred.speculative) set to false .. I am not sure >> what I am doing wrong.. everything worked before the update.. Any >> help.. >> >> Regards >> >> Language identifier configuration [1-4/2048] >> map 100% reduce 0% >> Language identifier plugin supports: it(1000) is(1000) hu(1000) >> th(1000) sv(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000) >> el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000) >> nl(1000) >> Adding org.apache.nutch.analysis.lang.LanguageIndexingFilter >> running sort pass >> flushing segment 0 >> reduce > sort >> found resource common-terms.utf8 at >> file:/usr/local/nutch-0.9-dev/conf/common-terms.utf8 >> Optimizing index. >> Optimizing index. >> job_qmhsvz >> java.lang.RuntimeException: Unexpected status: 67 >> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:307) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137) >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399) >> at org.apache.nutch.indexer.Indexer.index(Indexer.java:297) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)
I can confirm that indeed it is a bug. I'll provide a patch soon - in the meantime you can just remove the "throws" clause - other datums will simply be ignored. The underlying issue is quite interesting - the status code that it's complaining about is CrawlDatum.STATUS_LINKED, which indicates a page that was redirected. However, as you can see there are probably some inlinks pointing to this page. Now, the question is - should we discard this page (and index only the target)? The answer is not simple. BTW. if you guys are brave enough to use the bleeding-edge from SVN, then you are expected to discuss any issues that may arise from its use on nutch-dev - this mailing list is for users of regular releases, or stable versions ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
