[Nutch-general] Recovering from Socket closed

Chris Schneider Mon, 30 Jan 2006 19:47:04 -0800

Dear Nutch Users,

We've been using Nutch 0.8 (MapReduce) to perform some internetcrawling. Things seemed to be going well on our 11 machines (1 masterwith JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)until...


060129 222409 Lost tracker 'tracker_56288'
060129 222409 Task 'task_m_10gs5f' has been lost.
060129 222409 Task 'task_m_10qhzr' has been lost.
   ........
   ........
060129 222409 Task 'task_r_zggbwu' has been lost.
060129 222409 Task 'task_r_zh8dao' has been lost.

060129 222455 Server handler 8 on 8010 caught:java.net.SocketException: Socket closed

java.net.SocketException: Socket closed
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)

atjava.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)

        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.nutch.ipc.Server$Handler.run(Server.java:216)
060129 222455 Adding task 'task_m_cia5po' to set for tracker 'tracker_56288'
060129 223711 Adding task 'task_m_ffv59i' to set for tracker 'tracker_25647'

I'm hoping that someone could explain why task_m_cia5po got added totracker_56288 after this tracker was lost.

The TaskTracker currently running on that slave is now calledtracker_25647, and there's nothing in its TaskTracker log referringto task_m_cia5po. Here's a snippet from this timeframe:


060129 215343 task_r_a61kvl 0.9997992% reduce > reduce
060129 215349 task_r_a61kvl  Recovered from failed datanode connection
060129 215349 task_r_a61kvl 1.0% reduce > reduce
060129 215349 Task task_r_a61kvl is done.
060129 215350 Server connection on port 50050 from 127.0.0.1: exiting

060129 221629 Lost connection to JobTracker[m1.krugle.net/192.168.1.11:8010]. ex=jav$060129 222134 Lost connection to JobTracker[m1.krugle.net/192.168.1.11:8010]. ex=jav$

060129 222455 task_m_1cyb7c done; removing files.
   ........
   ........
060129 222535 task_m_zgpg9c done; removing files.
060129 222711 Stopping server on 50050
060129 222735 Server listener on port 50050: exiting
060129 222844 Server handler 0 on 50050: exiting
060129 222845 Server handler 6 on 50050: exiting
060129 222846 Server handler 1 on 50050: exiting
060129 222847 Server handler 3 on 50050: exiting
060129 222848 Server handler 2 on 50050: exiting
060129 222849 Server handler 4 on 50050: exiting
060129 222849 Server handler 5 on 50050: exiting
060129 222849 Server handler 7 on 50050: exiting
060129 223211 Stopping server on 50040
060129 223226 Client connection to 192.168.1.8:50040: closing
060129 223229 Client connection to 192.168.1.1:50040: closing
060129 223230 Client connection to 127.0.0.1:50040: closing
060129 223230 Server connection on port 50040 from 127.0.0.1: exiting
060129 223230 Client connection to 127.0.0.1:50040: closing
060129 223230 Server connection on port 50040 from 127.0.0.1: exiting
060129 223250 Server connection on port 50040 from 192.168.1.7: exiting
060129 223408 Server connection on port 50040 from 192.168.1.8: exiting
060129 223505 Server listener on port 50040: exiting
060129 223531 Server connection on port 50040 from 192.168.1.1: exiting
060129 223532 Server connection on port 50040 from 192.168.1.3: exiting
060129 223536 Server connection on port 50040 from 192.168.1.6: exiting
060129 223548 Server handler 0 on 50040: exiting
060129 223550 Server handler 6 on 50040: exiting
060129 223554 Server handler 3 on 50040: exiting
060129 223554 Server handler 2 on 50040: exiting
060129 223556 Server connection on port 50040 from 192.168.1.4: exiting
060129 223556 Server handler 7 on 50040: exiting
060129 223559 Server handler 1 on 50040: exiting
060129 223601 Server connection on port 50040 from 192.168.1.2: exiting
060129 223601 Server handler 5 on 50040: exiting
060129 223602 Server connection on port 50040 from 192.168.1.5: exiting
060129 223604 Server handler 4 on 50040: exiting
060129 223648 Client connection to 192.168.1.2:50040: closing
060129 223704 Server connection on port 50040 from 192.168.1.9: exiting
060129 223707 Client connection to 192.168.1.5:50040: closing
060129 223707 Client connection to 192.168.1.4:50040: closing
060129 223710 Client connection to 192.168.1.9:50040: closing
060129 223711 Reinitializing local state
060129 223711 Server listener on port 50050: starting
060129 223711 Server handler 0 on 50050: starting
060129 223711 Server handler 1 on 50050: starting
060129 223711 Server handler 2 on 50050: starting
060129 223711 Server handler 3 on 50050: starting
060129 223711 Server handler 4 on 50050: starting
060129 223711 Server handler 5 on 50050: starting
060129 223711 Server handler 6 on 50050: starting
060129 223711 Server handler 7 on 50050: starting
060129 223711 Server listener on port 50040: starting
060129 223711 Server handler 0 on 50040: starting
060129 223711 Server handler 1 on 50040: starting
060129 223711 Server handler 2 on 50040: starting
060129 223711 Server handler 3 on 50040: starting
060129 223711 Server handler 4 on 50040: starting
060129 223711 Server handler 5 on 50040: starting
060129 223711 Server handler 6 on 50040: starting
060129 223711 Server handler 7 on 50040: starting
060129 223711 parsing file:/home/crawler/nutch/conf/nutch-default.xml

So it does look like this TaskTracker relaunched itself and is nowcommunicating with the JobTracker.

Unfortunately, the JobTracker is still hung up waiting fortask_m_cia5po (the last of 4960 map tasks in the indexing job_7nflgy)to complete, but as it's been assigned to a TaskTracker that is nolonger active, no progress is being made.

Also, since we've been running this crawl for quite some time, we'dlike to preserve the segment data if at all possible. Could someoneplease recommend a way to recover as gracefully as possible from thiscondition? The Crawl .main process died with the following output:

060129 221129 Indexer: adding segment:/user/crawler/crawl-20060129091444/segments/20060129200246

Exception in thread "main" java.io.IOException: timed out waiting for response
        at org.apache.nutch.ipc.Client.call(Client.java:296)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy1.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)

However, it definitely seems as if the JobTracker is still waitingfor the job to finish (no failed jobs).

My thought would be to first try stopping the TaskTracker process onthe slave. If this didn't result in the JobTracker state changing, Iguess I would just do a stop-all.sh. Afterward, I'd probablystart-all.sh, then execute the following command to toss theincompletely indexed segment:


bin/nutch ndfs -rm crawl-20060129091444/segments/20060129200246

Other Thoughts?

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Recovering from Socket closed

Reply via email to