Jeff Buell created HDFS-7122:
--------------------------------
Summary: Very poor distribution of replication copies
Key: HDFS-7122
URL: https://issues.apache.org/jira/browse/HDFS-7122
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.3.0
Environment: medium-large environments with 100's to 1000's of DNs
will be most affected, but potentially all environments.
Reporter: Jeff Buell
Summary:
Since HDFS-6268, the distribution of replica block copies across the DataNodes
(replicas 2,3,... as distinguished from the first "primary" replica) is
extremely poor, to the point that TeraGen slows down by as much as 3X for
certain configurations. This is almost certainly due to the introduction of
Thread Local Random in HDFS-6268. The mechanism appears to be that this change
causes all the random numbers in the threads to be correlated, thus preventing
a truly random choice of DN for each replica copy.
Testing details:
1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 256MB
block size. This results in 6 "primary" blocks on each DN. With
replication=3, there will be on average 12 more copies on each DN that are
copies of blocks from other DNs. Because of the random selection of DNs,
exactly 12 copies are not expected, but I found that about 160 DNs (1/4 of all
DNs!) received absolutely no copies, while one DN received over 100 copies, and
the elapsed time increased by about 3X from a pre-HDFS-6268 distro. There was
no pattern to which DNs didn't receive copies, nor was the set of such DNs
repeatable run-to-run. In addition to the performance problem, there could be
capacity problems due to one or a few DNs running out of space. Testing was
done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I don't see a significant
difference from the Apache Hadoop source in this regard. The workaround to
recover the previous behavior is to set dfs.namenode.handler.count=1 but of
course this has scaling implications for large clusters.
I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a
better algorithm can be implemented and tested. Testing should include a case
with many DNs and a small number of blocks on each.
It should also be noted that even pre-HDFS-6268, the random choice of DN
algorithm produces a rather non-uniform distribution of copies. This is not
due to any bug, but purely a case of random distributions being much less
uniform than one might intuitively expect. In the above case, pre-HDFS-6268
yields something like a range of 3 to 25 block copies on each DN. Surprisingly,
the performance penalty of this non-uniformity is not as big as might be
expected (maybe only 10-20%), but HDFS should do better, and in any case the
capacity issue remains. Round-robin choice of DN? Better awareness of which
DNs currently store fewer blocks? It's not sufficient that the total number of
blocks is similar on each DN at the end, but that at each point in time no
individual DN receives a disproportionate number of blocks at once (which could
be a danger of a RR algorithm).
Probably should limit this jira to tracking the ThreadLocal issue, and track
the random choice issue in another one.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)