Re: Recovering from Socket closed

Doug Cutting Tue, 31 Jan 2006 11:01:34 -0800

Chris Schneider wrote:

Also, since we've been running this crawl for quite some time, we'd liketo preserve the segment data if at all possible. Could someone pleaserecommend a way to recover as gracefully as possible from thiscondition? The Crawl .main process died with the following output:
060129 221129 Indexer: adding segment:/user/crawler/crawl-20060129091444/segments/20060129200246Exception in thread "main" java.io.IOException: timed out waiting forresponse
    at org.apache.nutch.ipc.Client.call(Client.java:296)
    at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
    at $Proxy1.submitJob(Unknown Source)
    at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
    at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
    at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
However, it definitely seems as if the JobTracker is still waiting forthe job to finish (no failed jobs).

Have you looked at the web ui? It will show if things are stillrunning. This is on the jobtracker host at port 50030 by default.

The bug here is that the RPC call times out while the map task iscomputing splits. The fix is that the job tracker should not computesplits until after it has returned from the submitJob RPC. Pleasesubmit a bug in Jira to help remind us to fix this.

To recover, first determine if the indexing has completed. If it hasnot, then use the 'index' command to index things, followed by 'dedup'and 'merge'. Look at the source for Crawl.java:


http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup

All you need to do to complete the crawl is to complete the last fewsteps manually.


Cheers,

Doug

Re: Recovering from Socket closed

Reply via email to