Doug, et. al., >Chris Schneider wrote: >>Also, since we've been running this crawl for quite some time, we'd like to >>preserve the segment data if at all possible. Could someone please recommend >>a way to recover as gracefully as possible from this condition? The Crawl >>.main process died with the following output: >> >>060129 221129 Indexer: adding segment: >>/user/crawler/crawl-20060129091444/segments/20060129200246 >>Exception in thread "main" java.io.IOException: timed out waiting for response >> at org.apache.nutch.ipc.Client.call(Client.java:296) >> at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) >> at $Proxy1.submitJob(Unknown Source) >> at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) >> at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) >> at org.apache.nutch.indexer.Indexer.index(Indexer.java:263) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) >> >>However, it definitely seems as if the JobTracker is still waiting for the >>job to finish (no failed jobs). > >Have you looked at the web ui? It will show if things are still running. >This is on the jobtracker host at port 50030 by default.
Yes, this is how I know the JobTracker is still waiting for task_m_cia5po to complete. >The bug here is that the RPC call times out while the map task is computing >splits. The fix is that the job tracker should not compute splits until after >it has returned from the submitJob RPC. Please submit a bug in Jira to help >remind us to fix this. I'll be happy to log a bug for this. Is there a work-around? Based on some other postings, I've increased ipc.client.timeout to 300000 (5 minutes). Does this property also control the timeout for the RPC call you describe above? If so, should I increase this timeout further? Is there a better way for us to avoid getting caught by the RPC timeout you describe? This crawl was only a medium-sized test. We hope to execute a much larger crawl over the next few days. >To recover, first determine if the indexing has completed. If it has not, >then use the 'index' command to index things, followed by 'dedup' and 'merge'. > Look at the source for Crawl.java: > >http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup > >All you need to do to complete the crawl is to complete the last few steps >manually. We've done these steps manually before, so I'll get on that now. I was just worried about whether to trust these segments, how best to restart the processes, etc. Thanks, - Chris -- ------------------------ Chris Schneider TransPac Software, Inc. [EMAIL PROTECTED] ------------------------
