[ https://issues.apache.org/jira/browse/HDFS-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880475#comment-17880475 ]
ASF GitHub Bot commented on HDFS-17620: --------------------------------------- junegunn opened a new pull request, #7035: URL: https://github.com/apache/hadoop/pull/7035 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR Fixes block placement for small EC files. Detailed description is in the JIRA issue. ### How was this patch tested? Tested on our production HBase cluster running Hadoop 3.3.6. ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Better block placement for small EC files > ----------------------------------------- > > Key: HDFS-17620 > URL: https://issues.apache.org/jira/browse/HDFS-17620 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding, namenode > Affects Versions: 3.3.6 > Reporter: Junegunn Choi > Priority: Major > > h2. Problem description > If an erasure-coded file is not large enough to fill the stripe width of the > EC policy, the block distribution can be suboptimal. > For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data > block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen > by the block placement policy, only 4 of them are used, and the last 3 > locations are for parity blocks. If the cluster has a very small number of > racks (e.g. 3), with the current scheme to find a pipeline with the shortest > path, the last nodes are likely to be in the same rack, resulting in a > suboptimal rack distribution. > {noformat} > Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9 > Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3 > Blocks: D1 P1 P2 P3 > {noformat} > We can see that blocks are stored in only 2 racks, not 3. > Because the block does not have enough racks, {{ErasureCodingWork}} will > later be created to replicate the block to a new rack, however, the current > code tries to copy the block to the first node in the chosen locations, > regardless of its rack. So it is not guaranteed to improve the situation, and > we constantly see {{PendingReconstructionMonitor timed out}} messages in the > log. > h2. Proposed solution > 1. Reorder the chosen locations by rack so that the parity blocks are stored > in as many racks as possible. > 2. Make {{ErasureCodingWork}} try to find a target on a new rack > h2. Real-world test result > We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 > nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data > directory, we noticed that > 1. FSCK reports "Unsatisfactory placement block groups" for small EC files. > {noformat} > /hbase/***: Replica placement policy is violated for ***. Block should be > additionally replicated on 2 more rack(s). Total number of racks in the > cluster: 3 > ... > Erasure Coded Block Groups: > ... > Unsatisfactory placement block groups: 1475 (2.5252092 %) > {noformat} > 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages > every recheck-interval (5 minutes). > 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every > recheck-interval. > After applying the patch, all the problems are gone. "Unsatisfactory > placement block groups" is now zero. No metrics bumps or "timed out" logs. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org