Jeff Buell created HDFS-7122:
--------------------------------

             Summary: Very poor distribution of replication copies
                 Key: HDFS-7122
                 URL: https://issues.apache.org/jira/browse/HDFS-7122
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.3.0
         Environment: medium-large environments with 100's to 1000's of DNs 
will be most affected, but potentially all environments.
            Reporter: Jeff Buell


Summary:
Since HDFS-6268, the distribution of replica block copies across the DataNodes 
(replicas 2,3,... as distinguished from the first "primary" replica) is 
extremely poor, to the point that TeraGen slows down by as much as 3X for 
certain configurations.  This is almost certainly due to the introduction of 
Thread Local Random in HDFS-6268.  The mechanism appears to be that this change 
causes all the random numbers in the threads to be correlated, thus preventing 
a truly random choice of DN for each replica copy.

Testing details:
1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 256MB 
block size.  This results in 6 "primary" blocks on each DN.  With 
replication=3, there will be on average 12 more copies on each DN that are 
copies of blocks from other DNs.  Because of the random selection of DNs, 
exactly 12 copies are not expected, but I found that about 160 DNs (1/4 of all 
DNs!) received absolutely no copies, while one DN received over 100 copies, and 
the elapsed time increased by about 3X from a pre-HDFS-6268 distro.  There was 
no pattern to which DNs didn't receive copies, nor was the set of such DNs 
repeatable run-to-run. In addition to the performance problem, there could be 
capacity problems due to one or a few DNs running out of space. Testing was 
done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don't see a significant 
difference from the Apache Hadoop source in this regard. The workaround to 
recover the previous behavior is to set dfs.namenode.handler.count=1 but of 
course this has scaling implications for large clusters.

I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a 
better algorithm can be implemented and tested.  Testing should include a case 
with many DNs and a small number of blocks on each.

It should also be noted that even pre-HDFS-6268, the random choice of DN 
algorithm produces a rather non-uniform distribution of copies.  This is not 
due to any bug, but purely a case of random distributions being much less 
uniform than one might intuitively expect. In the above case, pre-HDFS-6268 
yields something like a range of 3 to 25 block copies on each DN. Surprisingly, 
the performance penalty of this non-uniformity is not as big as might be 
expected (maybe only 10-20%), but HDFS should do better, and in any case the 
capacity issue remains.  Round-robin choice of DN?  Better awareness of which 
DNs currently store fewer blocks? It's not sufficient that the total number of 
blocks is similar on each DN at the end, but that at each point in time no 
individual DN receives a disproportionate number of blocks at once (which could 
be a danger of a RR algorithm).

Probably should limit this jira to tracking the ThreadLocal issue, and track 
the random choice issue in another one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to