[ https://issues.apache.org/jira/browse/HDFS-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junegunn Choi updated HDFS-17620: --------------------------------- Description: h2. Problem description If an erasure-coded file is not large enough to fill the stripe width of the EC policy, the block distribution can be suboptimal. For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the block placement policy, only 4 of them are used, and the last 3 locations are for parity blocks. If the cluster has a very small number of racks (e.g. 3), with the current scheme to find a pipeline with the shortest path, the last nodes are likely to be in the same rack, resulting in a suboptimal rack distribution. {noformat} Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9 Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3 Blocks: D1 P1 P2 P3 {noformat} We can see that blocks are stored in only 2 racks, not 3. Because the block does not have enough racks, {{ErasureCodingWork}} will later be created to replicate the block to a new rack, however, the current code tries to copy the block to the first node in the chosen locations, regardless of its rack. So it is not guaranteed to improve the situation, and we constantly see {{PendingReconstructionMonitor timed out}} messages in the log. h2. Proposed solution 1. Reorder the chosen locations by rack so that the parity blocks are stored in as many racks as possible. 2. Make {{ErasureCodingWork}} try to find a target on a new rack h2. Real-world test result We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data directory, we noticed that 1. FSCK reports "Unsatisfactory placement block groups" for small EC files. {noformat} /hbase/***: Replica placement policy is violated for ***. Block should be additionally replicated on 2 more rack(s). Total number of racks in the cluster: 3 ... Erasure Coded Block Groups: ... Unsatisfactory placement block groups: 1475 (2.5252092 %) {noformat} 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages every recheck-interval (5 minutes). 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every recheck-interval. After applying the patch, all the problems are gone. "Unsatisfactory placement block groups" is now zero. No metrics bumps or "timed out" logs. !screenshot-1.png|width=500! was: h2. Problem description If an erasure-coded file is not large enough to fill the stripe width of the EC policy, the block distribution can be suboptimal. For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the block placement policy, only 4 of them are used, and the last 3 locations are for parity blocks. If the cluster has a very small number of racks (e.g. 3), with the current scheme to find a pipeline with the shortest path, the last nodes are likely to be in the same rack, resulting in a suboptimal rack distribution. {noformat} Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9 Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3 Blocks: D1 P1 P2 P3 {noformat} We can see that blocks are stored in only 2 racks, not 3. Because the block does not have enough racks, {{ErasureCodingWork}} will later be created to replicate the block to a new rack, however, the current code tries to copy the block to the first node in the chosen locations, regardless of its rack. So it is not guaranteed to improve the situation, and we constantly see {{PendingReconstructionMonitor timed out}} messages in the log. h2. Proposed solution 1. Reorder the chosen locations by rack so that the parity blocks are stored in as many racks as possible. 2. Make {{ErasureCodingWork}} try to find a target on a new rack h2. Real-world test result We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data directory, we noticed that 1. FSCK reports "Unsatisfactory placement block groups" for small EC files. {noformat} /hbase/***: Replica placement policy is violated for ***. Block should be additionally replicated on 2 more rack(s). Total number of racks in the cluster: 3 ... Erasure Coded Block Groups: ... Unsatisfactory placement block groups: 1475 (2.5252092 %) {noformat} 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages every recheck-interval (5 minutes). 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every recheck-interval. After applying the patch, all the problems are gone. "Unsatisfactory placement block groups" is now zero. No metrics bumps or "timed out" logs. !screenshot-1.png|thumbnail! > Better block placement for small EC files > ----------------------------------------- > > Key: HDFS-17620 > URL: https://issues.apache.org/jira/browse/HDFS-17620 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding, namenode > Affects Versions: 3.3.6 > Reporter: Junegunn Choi > Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > h2. Problem description > If an erasure-coded file is not large enough to fill the stripe width of the > EC policy, the block distribution can be suboptimal. > For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data > block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen > by the block placement policy, only 4 of them are used, and the last 3 > locations are for parity blocks. If the cluster has a very small number of > racks (e.g. 3), with the current scheme to find a pipeline with the shortest > path, the last nodes are likely to be in the same rack, resulting in a > suboptimal rack distribution. > {noformat} > Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9 > Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3 > Blocks: D1 P1 P2 P3 > {noformat} > We can see that blocks are stored in only 2 racks, not 3. > Because the block does not have enough racks, {{ErasureCodingWork}} will > later be created to replicate the block to a new rack, however, the current > code tries to copy the block to the first node in the chosen locations, > regardless of its rack. So it is not guaranteed to improve the situation, and > we constantly see {{PendingReconstructionMonitor timed out}} messages in the > log. > h2. Proposed solution > 1. Reorder the chosen locations by rack so that the parity blocks are stored > in as many racks as possible. > 2. Make {{ErasureCodingWork}} try to find a target on a new rack > h2. Real-world test result > We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 > nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data > directory, we noticed that > 1. FSCK reports "Unsatisfactory placement block groups" for small EC files. > {noformat} > /hbase/***: Replica placement policy is violated for ***. Block should be > additionally replicated on 2 more rack(s). Total number of racks in the > cluster: 3 > ... > Erasure Coded Block Groups: > ... > Unsatisfactory placement block groups: 1475 (2.5252092 %) > {noformat} > 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages > every recheck-interval (5 minutes). > 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every > recheck-interval. > After applying the patch, all the problems are gone. "Unsatisfactory > placement block groups" is now zero. No metrics bumps or "timed out" logs. > !screenshot-1.png|width=500! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org