Better Target selection for block replication
---------------------------------------------

                 Key: HADOOP-1530
                 URL: https://issues.apache.org/jira/browse/HADOOP-1530
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.14.0
            Reporter: Enis Soztutar
             Fix For: 0.14.0


Block replication policy tends to balance the number of blocks in each datanode 
in the long run, however with heterogeneous clusters with varying number of 
disks per node, the nodes with one disk fill quickly while nodes with 3 disks 
still have 60% free disk space. This also reduces the advantage of using more 
than one disk for parallel IO, since machines with multiple disks are not used 
as much.

The javadoc of the ReplicationTargetChooser reads as : 
The replica placement strategy is that if the writer is on a datanode, the 1st 
replica is placed on the local machine, otherwise a random datanode. The 2nd 
replica is placed on a datanode that is on a different rack. The 3rd replica is 
placed on a datanode which is on the same rack as the first replica.

I think we should switch to a policy that balances the percent of disk usage 
rather than balancing total block count among the datanodes. This can be done 
by defining the probability of selection of a datanode based on its disk 
percent usage. A formula like 1 - (percent_usage / 100 ) seems reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to