Sangmin Lee wrote:
Hi folks,

I have a question regarding hdfs' load balancing when it chooses target
datanodes for a block.
From the code, it seems it make a decision based on the information from
previously heartbeats.
Since heartbeats come every 3 seconds, within that window we may end up
putting more load on some datanodes than others.
I noticed that for disk space balancing, namenode maintains scheduled block
information for each datanode which is updated whenever new block is
assigned to the datanodes.
Shouldn't we do a similar thing for traffic??

we should. HADOOP-3707 was meant for a dot release and thus didn't want to depend on the new stat too much that time. The comments in jira and in the code mention so.

Unless you have a large heartbeat, do you really think it makes a much difference in normal case? We would like to know if you saw any such cases.

It could help if there are large number of clients simultaneously writing from small set of nodes.

Based on discussions here at Yahoo.. this area of NN scheduling will undergo some improvements in near future especially to handle nodes with heterogeneous datanodes.

Raghu.

Thanks,
Sangmin Lee


Reply via email to