[ https://issues.apache.org/jira/browse/HDFS-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhe Zhang updated HDFS-10967: ----------------------------- Attachment: HDFS-10967.00.patch Thought more about the design options. Currently I think the most reasonable direction is to: # Apply the logic only when trying to {{chooseRemoteRack}} -- that is when the 2nd and 3rd replicas are being placed. In most cases, triaging the 2nd and 3rd replicas from near-full DNs is already sufficient to address the balancing issue. # Use the percentage of remaining capacity because current NN metrics are already based on percentage. So admins should be most comfortable operating based on it (e.g. {{97%}} full). # Do random selection a few times instead of completely avoiding near-full DNs. Like mentioned above, imbalance will cause an issue only if a large number of DNs are near full. So a statistical solution should be sufficient. > Add configuration for BlockPlacementPolicy to avoid near-full DataNodes > ----------------------------------------------------------------------- > > Key: HDFS-10967 > URL: https://issues.apache.org/jira/browse/HDFS-10967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Zhe Zhang > Assignee: Zhe Zhang > Labels: balancer > Attachments: HDFS-10967.00.patch, HDFS-10967.poc.patch > > > Large production clusters are likely to have heterogeneous nodes in terms of > storage capacity, memory, and CPU cores. It is not always possible to > proportionally ingest data into DataNodes based on their remaining storage > capacity. Therefore it's possible for a subset of DataNodes to be much closer > to full capacity than the rest. > This heterogeneity is most likely rack-by-rack -- i.e. _m_ whole racks of > low-storage nodes and _n_ whole racks of high-storage nodes. So It'd be very > useful if we can lower the chance for those near-full DataNodes to become > destinations for the 2nd and 3rd replicas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org