Chris Schneider wrote:
Also, since we've been running this crawl for quite some time, we'd like
to preserve the segment data if at all possible. Could someone please
recommend a way to recover as gracefully as possible from this
condition? The Crawl .main process died with the following output:
060129 221129 Indexer: adding segment:
/user/crawler/crawl-20060129091444/segments/20060129200246
Exception in thread "main" java.io.IOException: timed out waiting for
response
at org.apache.nutch.ipc.Client.call(Client.java:296)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
However, it definitely seems as if the JobTracker is still waiting for
the job to finish (no failed jobs).
Have you looked at the web ui? It will show if things are still
running. This is on the jobtracker host at port 50030 by default.
The bug here is that the RPC call times out while the map task is
computing splits. The fix is that the job tracker should not compute
splits until after it has returned from the submitJob RPC. Please
submit a bug in Jira to help remind us to fix this.
To recover, first determine if the indexing has completed. If it has
not, then use the 'index' command to index things, followed by 'dedup'
and 'merge'. Look at the source for Crawl.java:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup
All you need to do to complete the crawl is to complete the last few
steps manually.
Cheers,
Doug
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general