jiang licht wrote:
What are the exact packets and steps used to establish a namenode/datanode 
connection and  jobtracker/tasktracker connection?

I am asking this due to a weird problem related to starting datanodes and tasktrackers.
In my case, the namenode box has 2 ethernet interfaces combined as bond0 
interface with IP address of IP_A and there is an IP alias IP_B for local 
loopback interface as lo:1. All slave boxes sit on the same network segment as 
IP_B.

The network is configured such that no slave box can reach namenode box at IP_A but namenode box can reach slave 
boxes (clearly can only routed from bond0). So, slave boxes always use "hdfs://IP_B:50001" as 
"fs.default.name" in "core-site.xml" and use IP_B:50002" for job tracker in 
mapred-site.xml to reach namenode box.

There are the following 2 cases how namenode (or jobtracker) is configured on 
namenode box.

Case #1: If I set "fs.default.name" to "hdfs://IP_B:50001", no slave boxes can join the cluster as data nodes 
because the request to IP_B:50001 failed. "telnet IP_B 50001" on slave boxes resulted in connection refused. So, on 
namenode box, I fired "tcpdump -i bond0 tcp port 50001" and then from a slave box did a "telnet IP_B 5001" 
and watched for incoming and outgoing packets on namenode box.

Case #2: If I set "fs.default.name" to "hdfs://IP_A:50001", slave boxes can 
join the cluster as data nodes. And I did the same thing to use tcpdump and telnet to watch the 
traffic. I compared these two cases and found some difference in the traffic. So, I want to know if 
there is a hand-shaking stage for namenode and datanode to establish a connection and what are the 
packets for this purpose so that I can figure out if packets exchanged in case #1 are correct or 
not, which may reveal why the connection request from data node to name node fails.

Also in Case #2, although all slave boxes can join the cluster as datanodes, no slave box 
can start as a tasktracker because at the beginning of starting a tasktracker, the 
tasktracker box uses IP_A:50001 to request connection to namenode and as mentioned above 
(slaves are not allowed to reach namenode at IP_A but reverse direction is ok), this 
cannot be done. But my confusion here is that on all slave boxes 
"fs.default.name" is set to use IP_B:50001, how come it ended up with 
contacting the namenode with IP_A:50001?

A bit complicated. But any thoughts?


the NN listens on the card given by the IP address of its hostname; it does not like people connecting to it using a different hostname than the one it is on (irritating, something to fix)

It sounds like you have DNS problems. you should have a consistent mapping from hostname<-->IP Addr across the entire cluster, but the issues you have indicate this may not be the case.

Reply via email to