What are the exact packets and steps used to establish a namenode/datanode 
connection and  jobtracker/tasktracker connection?

I am asking this due to a weird problem related to starting datanodes and 
tasktrackers. 

In my case, the namenode box has 2 ethernet interfaces combined as bond0 
interface with IP address of IP_A and there is an IP alias IP_B for local 
loopback interface as lo:1. All slave boxes sit on the same network segment as 
IP_B.

The network is configured such that no slave box can reach namenode box at IP_A 
but namenode box can reach slave boxes (clearly can only routed from bond0). 
So, slave boxes always use "hdfs://IP_B:50001" as "fs.default.name" in 
"core-site.xml" and use IP_B:50002" for job tracker in mapred-site.xml to reach 
namenode box.

There are the following 2 cases how namenode (or jobtracker) is configured on 
namenode box.

Case #1: If I set "fs.default.name" to "hdfs://IP_B:50001", no slave boxes can 
join the cluster as data nodes because the request to IP_B:50001 failed. 
"telnet IP_B 50001" on slave boxes resulted in connection refused. So, on 
namenode box, I fired "tcpdump -i bond0 tcp port 50001" and then from a slave 
box did a "telnet IP_B 5001" and watched for incoming and outgoing packets on 
namenode box.

Case #2: If I set "fs.default.name" to "hdfs://IP_A:50001", slave boxes can 
join the cluster as data nodes. And I did the same thing to use tcpdump and 
telnet to watch the traffic. I compared these two cases and found some 
difference in the traffic. So, I want to know if there is a hand-shaking stage 
for namenode and datanode to establish a connection and what are the packets 
for this purpose so that I can figure out if packets exchanged in case #1 are 
correct or not, which may reveal why the connection request from data node to 
name node fails.

Also in Case #2, although all slave boxes can join the cluster as datanodes, no 
slave box can start as a tasktracker because at the beginning of starting a 
tasktracker, the tasktracker box uses IP_A:50001 to request connection to 
namenode and as mentioned above (slaves are not allowed to reach namenode at 
IP_A but reverse direction is ok), this cannot be done. But my confusion here 
is that on all slave boxes "fs.default.name" is set to use IP_B:50001, how come 
it ended up with contacting the namenode with IP_A:50001?

A bit complicated. But any thoughts?

Thanks,

Michael


      

Reply via email to