Re: [Nutch-general] Error after SVN update

Andrzej Bialecki Tue, 09 Jan 2007 05:29:50 -0800

Nutch Newbie wrote:
> Hi:
>
> Could some please be kind enough to confirm if the 0.9-dev trunk is
> broken. I did a total of 4 fresh install and every time I am getting
> stuck in indexing/reduce process. (Yes Speculative = false).
>
> It would feel much better if I am not the only one with this problem!
>
> Thank you for your help.
>
>
> On 1/8/07, Nutch Newbie <[EMAIL PROTECTED]> wrote:
>> Hi:
>>
>> I am getting the following error after updating to revision 494024. My
>> Hadoop-site.xml (mapred.speculative) set to false .. I am not sure
>> what I am doing wrong.. everything worked before the update.. Any
>> help..
>>
>> Regards
>>
>> Language identifier configuration [1-4/2048]
>>  map 100% reduce 0%
>> Language identifier plugin supports: it(1000) is(1000) hu(1000)
>> th(1000) sv(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
>> el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
>> nl(1000)
>> Adding org.apache.nutch.analysis.lang.LanguageIndexingFilter
>> running sort pass
>> flushing segment 0
>> reduce > sort
>> found resource common-terms.utf8 at
>> file:/usr/local/nutch-0.9-dev/conf/common-terms.utf8
>> Optimizing index.
>> Optimizing index.
>> job_qmhsvz
>> java.lang.RuntimeException: Unexpected status: 67
>>         at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:198)
>>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:307)
>>         at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:137)
>> Exception in thread "main" java.io.IOException: Job failed!
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
>>         at org.apache.nutch.indexer.Indexer.index(Indexer.java:297)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:134)


I can confirm that indeed it is a bug. I'll provide a patch soon - in 
the meantime you can just remove the "throws" clause - other datums will 
simply be ignored.

The underlying issue is quite interesting - the status code that it's 
complaining about is CrawlDatum.STATUS_LINKED, which indicates a page 
that was redirected. However, as you can see there are probably some 
inlinks pointing to this page. Now, the question is - should we discard 
this page (and index only the target)? The answer is not simple.

BTW. if you guys are brave enough to use the bleeding-edge from SVN, 
then you are expected to discuss any issues that may arise from its use 
on nutch-dev - this mailing list is for users of regular releases, or 
stable versions ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Error after SVN update

Reply via email to