Hi,
I've been using Hadoop for three months on a test cluster with two machines
and never had a single problem with it. However, after moving from version
0.20.0 to version 0.20.1, the tasktracker is going down frequently. The HDFS
still works. I can read and write files, even run HBase over it. The problem
is that I can't schedule jobs anymore. Bellow are the logs. Does anybody
know what may be happening? Any clue? The tasktracker dies and doesn't come
back until I restart it.
Thanks in advance,
Lucas
Log file: hadoop-root-tasktracker-server2.log
2009-10-17 18:06:39,044 INFO org.apache.hadoop.mapred.IndexCache: Map ID
attempt_200910151507_2024_m_000001_0 not found in cache
2009-10-17 18:06:39,044 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200910151507_2024_m_000000_0 done; removing files.
2009-10-17 18:07:12,103 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call to server2/192.168.1.3:9001
failed on local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1215)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1037)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
2009-10-17 18:07:12,104 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'server2' with reponseId '-4573
2009-10-17 18:07:13,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 0 time(s).
2009-10-17 18:07:14,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 1 time(s).
2009-10-17 18:07:15,106 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 2 time(s).
2009-10-17 18:07:16,106 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 3 time(s).
2009-10-17 18:07:17,106 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 4 time(s).
2009-10-17 18:07:18,107 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 5 time(s).
2009-10-17 18:07:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 6 time(s).
2009-10-17 18:07:20,108 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 7 time(s).
2009-10-17 18:07:21,108 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 8 time(s).
2009-10-17 18:07:22,109 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 9 time(s).
2009-10-17 18:07:22,111 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.net.ConnectException: Call to
server2/192.168.1.3:9001 failed on connection exception:
java.net.ConnectException: Connection refused
at org.apache.hadoop.ipc.Client.wrapException(Client.java:766)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1215)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1037)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
at
org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
... 6 more
2009-10-17 18:07:22,111 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'server2' with reponseId '-4573
2009-10-17 18:07:23,112 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 0 time(s).
2009-10-17 18:07:24,112 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 1 time(s).
2009-10-17 18:07:25,113 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 2 time(s).
2009-10-17 18:07:26,113 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 3 time(s).
2009-10-17 18:07:27,114 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 4 time(s).
2009-10-17 18:07:28,114 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 5 time(s).
2009-10-17 18:07:29,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 6 time(s).
2009-10-17 18:07:30,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 7 time(s).
2009-10-17 18:07:31,116 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 8 time(s).
2009-10-17 18:07:32,116 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 9 time(s).
Log file: hadoop-root-tasktracker-server.log
2009-10-17 18:06:41,680 INFO org.apache.hadoop.mapred.TaskTracker: Received
'KillJobAction' for job: job_200910151507_2024
2009-10-17 18:06:41,680 WARN org.apache.hadoop.mapred.TaskTracker: Unknown
job job_200910151507_2024 being deleted.
2009-10-17 18:07:11,695 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call to server2/192.168.1.3:9001
failed on local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1215)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1037)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
2009-10-17 18:07:11,695 INFO org.apache.hadoop.mapred.TaskTracker: Resending
'status' to 'server2' with reponseId '-4584
2009-10-17 18:07:12,696 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 0 time(s).
2009-10-17 18:07:13,697 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 1 time(s).
2009-10-17 18:07:14,698 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 2 time(s).
2009-10-17 18:07:15,699 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 3 time(s).
2009-10-17 18:07:16,699 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 4 time(s).
2009-10-17 18:07:17,700 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 5 time(s).
2009-10-17 18:07:18,701 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 6 time(s).
2009-10-17 18:07:19,701 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 7 time(s).
2009-10-17 18:07:20,702 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 8 time(s).
2009-10-17 18:07:21,703 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: server2/192.168.1.3:9001. Already tried 9 time(s).
2009-10-17 18:07:21,704 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.net.ConnectException: Call to
server2/192.168.1.3:9001 failed on connection exception:
java.net.ConnectException: Connection refused
at org.apache.hadoop.ipc.Client.wrapException(Client.java:766)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1215)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1037)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
at
org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
... 6 more
This happens when I try to run a job.
r...@server2:/usr/local/hadoop-0.20.1# bin/hadoop jar
ninvest-feedcrawler-0.3.0.jar com.nash.ninvest.backend.feed.Main
/ninvest/feeds news
news_checksum news_indexable 1
09/10/19 07:31:24 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 0 time(s).
09/10/19 07:31:25 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 1 time(s).
09/10/19 07:31:26 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 2 time(s).
09/10/19 07:31:27 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 3 time(s).
09/10/19 07:31:28 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 4 time(s).
09/10/19 07:31:29 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 5 time(s).
09/10/19 07:31:30 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 6 time(s).
09/10/19 07:31:31 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 7 time(s).
09/10/19 07:31:32 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 8 time(s).
09/10/19 07:31:33 INFO ipc.Client: Retrying connect to server: server2/
192.168.1.3:9001. Already tried 9 time(s).
Exception in thread "main" java.net.ConnectException: Call to server2/
192.168.1.3:9001 failed on connection exception:
java.net.ConnectException: Connection refused
at org.apache.hadoop.ipc.Client.wrapException(Client.java:766)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy0.getProtocolVersion(Unknown
Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:429)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423)
at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:410)
at org.apache.hadoop.mapreduce.Job.<init>(Job.java:50)
at com.nash.ninvest.backend.feed.Main.createSubmittableJob(Unknown
Source)
at com.nash.ninvest.backend.feed.Main.run(Unknown Source)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.nash.ninvest.backend.feed.Main.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
at
org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
... 16 more