The question probably sounds silly. It's weird that I got the following issues.
Namenode and datanode can start w/o any problem and the hdfs reports healthy.
But tasktracker on slaves cannot start. In tasktracker log, I found it keeps
trying to talk to namenode a...@a. But actually, in core-site.xml, for
namenode, the setting is b...@a. But yes, A and B are all IP address for the
namenode box. Actually B is a IP alias for loopback on namenode box. So,
basically, datanode is expected to request to b...@a but will be answered by
a...@a and this is fine and the hdfs is created. Now, to start tasktracker, it
seems that it also needs to contact namenode. But somehow, rather using b...@a,
it uses a...@a, which I don't understand. Where does tasktracker get A? Is
there a setting specifically for tasktracker to figure out namenode IP address
and port? If it reads from core-site.xml, it should use b...@a instead of
a...@a. I am confused. Any thoughts?
Here's what is set in core-site.xml
dfs.default.name=>hdfs://B:50001
Here's what is set in mapred-site.xml
mapred.job.tracker=>B:50002
And on slave boxes, B is seen different from A. And slave boxes can reach B but
not A (this is why tasktracker cannot start by contacting namenode at
a...@50001, see the following error message)
Here is the list of tasktracker log:
...
2010-03-08 21:04:06,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: /A:50001. Already tried 44 time(s).
2010-03-08 21:04:26,170 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.net.SocketTimeoutException: Call to /A:50001 failed on socket
timeout exception: java.net.SocketTimeoutException: 20000 millis timeout while
waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/A:50001]
at org.apache.hadoop.ipc.Client.wrapException(Client.java:771)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy5.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:110)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:211)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:174)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1448)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1476)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:197)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1034)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1721)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2834)
Caused by: java.net.SocketTimeoutException: 20000 millis timeout while waiting
for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/A:50001]
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:407)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
at org.apache.hadoop.ipc.Client.call(Client.java:720)
... 16 more
2010-03-08 21:04:47,178 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: /A:50001. Already tried 0 time(s).
Thanks,
--
Michael