[ 
https://issues.apache.org/jira/browse/HDFS-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junegunn Choi updated HDFS-17620:
---------------------------------
    Description: 
h2. Problem description

If an erasure-coded file is not large enough to fill the stripe width of the EC 
policy, the block distribution can be suboptimal.

For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block 
and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the 
block placement policy, only 4 of them are used, and the last 3 locations are 
for parity blocks. If the cluster has a very small number of racks (e.g. 3), 
with the current scheme to find a pipeline with the shortest path, the last 
nodes are likely to be in the same rack, resulting in a suboptimal rack 
distribution.
{noformat}
Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
    Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
   Blocks: D1                P1 P2 P3
{noformat}
We can see that blocks are stored in only 2 racks, not 3.

Because the block does not have enough racks, {{ErasureCodingWork}} will later 
be created to replicate the block to a new rack, however, the current code 
tries to copy the block to the first node in the chosen locations, regardless 
of its rack. So it is not guaranteed to improve the situation, and we 
constantly see {{PendingReconstructionMonitor timed out}} messages in the log.
h2. Proposed solution

1. Reorder the chosen locations by rack so that the parity blocks are stored in 
as many racks as possible.
2. Make {{ErasureCodingWork}} try to find a target on a new rack
h2. Real-world test result

We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 
nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data 
directory, we noticed that

1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
{noformat}
  /hbase/***:  Replica placement policy is violated for ***. Block should be 
additionally replicated on 2 more rack(s). Total number of racks in the 
cluster: 3
  ...

  Erasure Coded Block Groups:
    ...
    Unsatisfactory placement block groups: 1475 (2.5252092 %)
  {noformat}
2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages 
every recheck-interval  (5 minutes).
3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every 
recheck-interval.

After applying the patch, all the problems are gone. "Unsatisfactory placement 
block groups" is now zero. No metrics bumps or "timed out" logs.

 !screenshot-1.png|width=500! 

  was:
h2. Problem description

If an erasure-coded file is not large enough to fill the stripe width of the EC 
policy, the block distribution can be suboptimal.

For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block 
and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the 
block placement policy, only 4 of them are used, and the last 3 locations are 
for parity blocks. If the cluster has a very small number of racks (e.g. 3), 
with the current scheme to find a pipeline with the shortest path, the last 
nodes are likely to be in the same rack, resulting in a suboptimal rack 
distribution.
{noformat}
Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
    Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
   Blocks: D1                P1 P2 P3
{noformat}
We can see that blocks are stored in only 2 racks, not 3.

Because the block does not have enough racks, {{ErasureCodingWork}} will later 
be created to replicate the block to a new rack, however, the current code 
tries to copy the block to the first node in the chosen locations, regardless 
of its rack. So it is not guaranteed to improve the situation, and we 
constantly see {{PendingReconstructionMonitor timed out}} messages in the log.
h2. Proposed solution

1. Reorder the chosen locations by rack so that the parity blocks are stored in 
as many racks as possible.
2. Make {{ErasureCodingWork}} try to find a target on a new rack
h2. Real-world test result

We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 
nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data 
directory, we noticed that

1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
{noformat}
  /hbase/***:  Replica placement policy is violated for ***. Block should be 
additionally replicated on 2 more rack(s). Total number of racks in the 
cluster: 3
  ...

  Erasure Coded Block Groups:
    ...
    Unsatisfactory placement block groups: 1475 (2.5252092 %)
  {noformat}
2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages 
every recheck-interval  (5 minutes).
3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every 
recheck-interval.

After applying the patch, all the problems are gone. "Unsatisfactory placement 
block groups" is now zero. No metrics bumps or "timed out" logs.

 !screenshot-1.png|thumbnail! 


> Better block placement for small EC files
> -----------------------------------------
>
>                 Key: HDFS-17620
>                 URL: https://issues.apache.org/jira/browse/HDFS-17620
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding, namenode
>    Affects Versions: 3.3.6
>            Reporter: Junegunn Choi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: screenshot-1.png
>
>
> h2. Problem description
> If an erasure-coded file is not large enough to fill the stripe width of the 
> EC policy, the block distribution can be suboptimal.
> For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data 
> block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen 
> by the block placement policy, only 4 of them are used, and the last 3 
> locations are for parity blocks. If the cluster has a very small number of 
> racks (e.g. 3), with the current scheme to find a pipeline with the shortest 
> path, the last nodes are likely to be in the same rack, resulting in a 
> suboptimal rack distribution.
> {noformat}
> Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
>     Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
>    Blocks: D1                P1 P2 P3
> {noformat}
> We can see that blocks are stored in only 2 racks, not 3.
> Because the block does not have enough racks, {{ErasureCodingWork}} will 
> later be created to replicate the block to a new rack, however, the current 
> code tries to copy the block to the first node in the chosen locations, 
> regardless of its rack. So it is not guaranteed to improve the situation, and 
> we constantly see {{PendingReconstructionMonitor timed out}} messages in the 
> log.
> h2. Proposed solution
> 1. Reorder the chosen locations by rack so that the parity blocks are stored 
> in as many racks as possible.
> 2. Make {{ErasureCodingWork}} try to find a target on a new rack
> h2. Real-world test result
> We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 
> nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data 
> directory, we noticed that
> 1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
> {noformat}
>   /hbase/***:  Replica placement policy is violated for ***. Block should be 
> additionally replicated on 2 more rack(s). Total number of racks in the 
> cluster: 3
>   ...
>   Erasure Coded Block Groups:
>     ...
>     Unsatisfactory placement block groups: 1475 (2.5252092 %)
>   {noformat}
> 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages 
> every recheck-interval  (5 minutes).
> 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every 
> recheck-interval.
> After applying the patch, all the problems are gone. "Unsatisfactory 
> placement block groups" is now zero. No metrics bumps or "timed out" logs.
>  !screenshot-1.png|width=500! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to