[
https://issues.apache.org/jira/browse/HDFS-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDFS-16182:
-----------------------------------
Fix Version/s: 3.4.0
3.2.3
3.3.2
Resolution: Fixed
Status: Resolved (was: Patch Available)
Merged to trunk, branch-3.3 and branch-3.2. Thanks [~max2049]!
> numOfReplicas is given the wrong value in
> BlockPlacementPolicyDefault$chooseTarget can cause DataStreamer to fail with
> Heterogeneous Storage
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-16182
> URL: https://issues.apache.org/jira/browse/HDFS-16182
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namanode
> Affects Versions: 3.4.0
> Reporter: Max Xie
> Assignee: Max Xie
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.2
>
> Attachments: HDFS-16182.patch
>
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> In our hdfs cluster, we use heterogeneous storage to store data in SSD for a
> better performance. Sometimes hdfs client transfer data in pipline, it will
> throw IOException and exit. Exception logs are below:
> ```
> java.io.IOException: Failed to replace a bad datanode on the existing
> pipeline due to no more good datanodes being available to try. (Nodes:
> current=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK],
>
> DatanodeInfoWithStorage[dn03_ip:5004,DS-a388c067-76a4-4014-a16c-ccc49c8da77b,SSD],
>
> DatanodeInfoWithStorage[dn04_ip:5004,DS-b81da262-0dd9-4567-a498-c516fab84fe0,SSD],
>
> DatanodeInfoWithStorage[dn05_ip:5004,DS-34e3af2e-da80-46ac-938c-6a3218a646b9,SSD]],
>
> original=[DatanodeInfoWithStorage[dn01_ip:5004,DS-ef7882e0-427d-4c1e-b9ba-a929fac44fb4,DISK],
>
> DatanodeInfoWithStorage[dn02_ip:5004,DS-3871282a-ad45-4332-866a-f000f9361ecb,DISK]]).
> The current failed datanode replacement policy is DEFAULT, and a client may
> configure this via
> 'dfs.client.block.write.replace-datanode-on-failure.policy' in its
> configuration.
> ```
> After search it, I found when existing pipline need replace new dn to
> transfer data, the client will get one additional dn from namenode and check
> that the number of dn is the original number + 1.
> ```
> ## DataStreamer$findNewDatanode
> if (nodes.length != original.length + 1) {
> throw new IOException(
> "Failed to replace a bad datanode on the existing pipeline "
> + "due to no more good datanodes being available to try. "
> + "(Nodes: current=" + Arrays.asList(nodes)
> + ", original=" + Arrays.asList(original) + "). "
> + "The current failed datanode replacement policy is "
> + dfsClient.dtpReplaceDatanodeOnFailure
> + ", and a client may configure this via '"
> + BlockWrite.ReplaceDatanodeOnFailure.POLICY_KEY
> + "' in its configuration.");
> }
> ```
> The root cause is that Namenode$getAdditionalDatanode returns multi datanodes
> , not one in DataStreamer.addDatanode2ExistingPipeline.
>
> Maybe we can fix it in BlockPlacementPolicyDefault$chooseTarget. I think
> numOfReplicas should not be assigned by requiredStorageTypes.
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]