[ 
https://issues.apache.org/jira/browse/HDFS-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880475#comment-17880475
 ] 

ASF GitHub Bot commented on HDFS-17620:
---------------------------------------

junegunn opened a new pull request, #7035:
URL: https://github.com/apache/hadoop/pull/7035

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   
   Fixes block placement for small EC files. Detailed description is in the 
JIRA issue.
   
   ### How was this patch tested?
   
   Tested on our production HBase cluster running Hadoop 3.3.6.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Better block placement for small EC files
> -----------------------------------------
>
>                 Key: HDFS-17620
>                 URL: https://issues.apache.org/jira/browse/HDFS-17620
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding, namenode
>    Affects Versions: 3.3.6
>            Reporter: Junegunn Choi
>            Priority: Major
>
> h2. Problem description
> If an erasure-coded file is not large enough to fill the stripe width of the 
> EC policy, the block distribution can be suboptimal.
> For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data 
> block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen 
> by the block placement policy, only 4 of them are used, and the last 3 
> locations are for parity blocks. If the cluster has a very small number of 
> racks (e.g. 3), with the current scheme to find a pipeline with the shortest 
> path, the last nodes are likely to be in the same rack, resulting in a 
> suboptimal rack distribution.
> {noformat}
> Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
>     Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
>    Blocks: D1                P1 P2 P3
> {noformat}
> We can see that blocks are stored in only 2 racks, not 3.
> Because the block does not have enough racks, {{ErasureCodingWork}} will 
> later be created to replicate the block to a new rack, however, the current 
> code tries to copy the block to the first node in the chosen locations, 
> regardless of its rack. So it is not guaranteed to improve the situation, and 
> we constantly see {{PendingReconstructionMonitor timed out}} messages in the 
> log.
> h2. Proposed solution
> 1. Reorder the chosen locations by rack so that the parity blocks are stored 
> in as many racks as possible.
> 2. Make {{ErasureCodingWork}} try to find a target on a new rack
> h2. Real-world test result
> We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 
> nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data 
> directory, we noticed that
> 1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
> {noformat}
>   /hbase/***:  Replica placement policy is violated for ***. Block should be 
> additionally replicated on 2 more rack(s). Total number of racks in the 
> cluster: 3
>   ...
>   Erasure Coded Block Groups:
>     ...
>     Unsatisfactory placement block groups: 1475 (2.5252092 %)
>   {noformat}
> 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages 
> every recheck-interval  (5 minutes).
> 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every 
> recheck-interval.
> After applying the patch, all the problems are gone. "Unsatisfactory 
> placement block groups" is now zero. No metrics bumps or "timed out" logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to