Dear Nutch Users,
We've been using Nutch 0.8 (MapReduce) to perform some internet
crawling. Things seemed to be going well on our 11 machines (1 master
with JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)
until...
060129 222409 Lost tracker 'tracker_56288'
060129 222409 Task 'task_m_10gs5f' has been lost.
060129 222409 Task 'task_m_10qhzr' has been lost.
........
........
060129 222409 Task 'task_r_zggbwu' has been lost.
060129 222409 Task 'task_r_zh8dao' has been lost.
060129 222455 Server handler 8 on 8010 caught:
java.net.SocketException: Socket closed
java.net.SocketException: Socket closed
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:216)
060129 222455 Adding task 'task_m_cia5po' to set for tracker 'tracker_56288'
060129 223711 Adding task 'task_m_ffv59i' to set for tracker 'tracker_25647'
I'm hoping that someone could explain why task_m_cia5po got added to
tracker_56288 after this tracker was lost.
The TaskTracker currently running on that slave is now called
tracker_25647, and there's nothing in its TaskTracker log referring
to task_m_cia5po. Here's a snippet from this timeframe:
060129 215343 task_r_a61kvl 0.9997992% reduce > reduce
060129 215349 task_r_a61kvl Recovered from failed datanode connection
060129 215349 task_r_a61kvl 1.0% reduce > reduce
060129 215349 Task task_r_a61kvl is done.
060129 215350 Server connection on port 50050 from 127.0.0.1: exiting
060129 221629 Lost connection to JobTracker
[m1.krugle.net/192.168.1.11:8010]. ex=jav$
060129 222134 Lost connection to JobTracker
[m1.krugle.net/192.168.1.11:8010]. ex=jav$
060129 222455 task_m_1cyb7c done; removing files.
........
........
060129 222535 task_m_zgpg9c done; removing files.
060129 222711 Stopping server on 50050
060129 222735 Server listener on port 50050: exiting
060129 222844 Server handler 0 on 50050: exiting
060129 222845 Server handler 6 on 50050: exiting
060129 222846 Server handler 1 on 50050: exiting
060129 222847 Server handler 3 on 50050: exiting
060129 222848 Server handler 2 on 50050: exiting
060129 222849 Server handler 4 on 50050: exiting
060129 222849 Server handler 5 on 50050: exiting
060129 222849 Server handler 7 on 50050: exiting
060129 223211 Stopping server on 50040
060129 223226 Client connection to 192.168.1.8:50040: closing
060129 223229 Client connection to 192.168.1.1:50040: closing
060129 223230 Client connection to 127.0.0.1:50040: closing
060129 223230 Server connection on port 50040 from 127.0.0.1: exiting
060129 223230 Client connection to 127.0.0.1:50040: closing
060129 223230 Server connection on port 50040 from 127.0.0.1: exiting
060129 223250 Server connection on port 50040 from 192.168.1.7: exiting
060129 223408 Server connection on port 50040 from 192.168.1.8: exiting
060129 223505 Server listener on port 50040: exiting
060129 223531 Server connection on port 50040 from 192.168.1.1: exiting
060129 223532 Server connection on port 50040 from 192.168.1.3: exiting
060129 223536 Server connection on port 50040 from 192.168.1.6: exiting
060129 223548 Server handler 0 on 50040: exiting
060129 223550 Server handler 6 on 50040: exiting
060129 223554 Server handler 3 on 50040: exiting
060129 223554 Server handler 2 on 50040: exiting
060129 223556 Server connection on port 50040 from 192.168.1.4: exiting
060129 223556 Server handler 7 on 50040: exiting
060129 223559 Server handler 1 on 50040: exiting
060129 223601 Server connection on port 50040 from 192.168.1.2: exiting
060129 223601 Server handler 5 on 50040: exiting
060129 223602 Server connection on port 50040 from 192.168.1.5: exiting
060129 223604 Server handler 4 on 50040: exiting
060129 223648 Client connection to 192.168.1.2:50040: closing
060129 223704 Server connection on port 50040 from 192.168.1.9: exiting
060129 223707 Client connection to 192.168.1.5:50040: closing
060129 223707 Client connection to 192.168.1.4:50040: closing
060129 223710 Client connection to 192.168.1.9:50040: closing
060129 223711 Reinitializing local state
060129 223711 Server listener on port 50050: starting
060129 223711 Server handler 0 on 50050: starting
060129 223711 Server handler 1 on 50050: starting
060129 223711 Server handler 2 on 50050: starting
060129 223711 Server handler 3 on 50050: starting
060129 223711 Server handler 4 on 50050: starting
060129 223711 Server handler 5 on 50050: starting
060129 223711 Server handler 6 on 50050: starting
060129 223711 Server handler 7 on 50050: starting
060129 223711 Server listener on port 50040: starting
060129 223711 Server handler 0 on 50040: starting
060129 223711 Server handler 1 on 50040: starting
060129 223711 Server handler 2 on 50040: starting
060129 223711 Server handler 3 on 50040: starting
060129 223711 Server handler 4 on 50040: starting
060129 223711 Server handler 5 on 50040: starting
060129 223711 Server handler 6 on 50040: starting
060129 223711 Server handler 7 on 50040: starting
060129 223711 parsing file:/home/crawler/nutch/conf/nutch-default.xml
So it does look like this TaskTracker relaunched itself and is now
communicating with the JobTracker.
Unfortunately, the JobTracker is still hung up waiting for
task_m_cia5po (the last of 4960 map tasks in the indexing job_7nflgy)
to complete, but as it's been assigned to a TaskTracker that is no
longer active, no progress is being made.
Also, since we've been running this crawl for quite some time, we'd
like to preserve the segment data if at all possible. Could someone
please recommend a way to recover as gracefully as possible from this
condition? The Crawl .main process died with the following output:
060129 221129 Indexer: adding segment:
/user/crawler/crawl-20060129091444/segments/20060129200246
Exception in thread "main" java.io.IOException: timed out waiting for response
at org.apache.nutch.ipc.Client.call(Client.java:296)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
However, it definitely seems as if the JobTracker is still waiting
for the job to finish (no failed jobs).
My thought would be to first try stopping the TaskTracker process on
the slave. If this didn't result in the JobTracker state changing, I
guess I would just do a stop-all.sh. Afterward, I'd probably
start-all.sh, then execute the following command to toss the
incompletely indexed segment:
bin/nutch ndfs -rm crawl-20060129091444/segments/20060129200246
Other Thoughts?
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general