The question probably sounds silly. It's weird that I got the following issues.

Namenode and datanode can start w/o any problem and the hdfs reports healthy. 

But tasktracker on slaves cannot start. In tasktracker log, I found it keeps 
trying to talk to namenode a...@a. But actually, in core-site.xml, for 
namenode, the setting is b...@a. But yes, A and B are all IP address for the 
namenode box. Actually B is a IP alias for loopback on namenode box. So, 
basically, datanode is expected to request to b...@a but will be answered by 
a...@a and this is fine and the hdfs is created. Now, to start tasktracker, it 
seems that it also needs to contact namenode. But somehow, rather using b...@a, 
it uses a...@a, which I don't understand. Where does tasktracker get A? Is 
there a setting specifically for tasktracker to figure out namenode IP address 
and port? If it reads from core-site.xml, it should use b...@a instead of 
a...@a. I am confused. Any thoughts?

Here's what is set in core-site.xml

dfs.default.name=>hdfs://B:50001

Here's what is set in mapred-site.xml

mapred.job.tracker=>B:50002

And on slave boxes, B is seen different from A. And slave boxes can reach B but 
not A (this is why tasktracker cannot start by contacting namenode at 
a...@50001, see the following error message)

Here is the list of tasktracker log:

...
2010-03-08 21:04:06,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /A:50001. Already tried 44 time(s).
2010-03-08 21:04:26,170 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
exception: java.net.SocketTimeoutException: Call to /A:50001 failed on socket 
timeout exception: java.net.SocketTimeoutException: 20000 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/A:50001]
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:771)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy5.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:110)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:211)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:174)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1448)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:67)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1476)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:197)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
    at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1034)
    at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1721)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2834)
Caused by: java.net.SocketTimeoutException: 20000 millis timeout while waiting 
for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/A:50001]
    at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:407)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
    at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
    at org.apache.hadoop.ipc.Client.call(Client.java:720)
    ... 16 more

2010-03-08 21:04:47,178 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /A:50001. Already tried 0 time(s).

Thanks,
--

Michael


      

Reply via email to