[
https://issues.apache.org/jira/browse/HDFS-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032303#comment-18032303
]
ASF GitHub Bot commented on HDFS-16987:
---------------------------------------
github-actions[bot] closed pull request #5583: HDFS-16987. [BugFix]
MarkBlockAsCorrupt should not mark a replica as corrupted if the DN has a
newest replica
URL: https://github.com/apache/hadoop/pull/5583
> NameNode should remove all invalid corrupted blocks when starting active
> service
> --------------------------------------------------------------------------------
>
> Key: HDFS-16987
> URL: https://issues.apache.org/jira/browse/HDFS-16987
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: ZanderXu
> Assignee: ZanderXu
> Priority: Critical
> Labels: pull-request-available
>
> In our prod environment, we encountered an incident where HA failover caused
> some new corrupted blocks, causing some jobs to fail.
>
> Traced down and found a bug in the processing of all pending DN messages when
> starting active services.
> The steps to reproduce are as follows:
> # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is
> unstable
> # Timing 1, client create a file, write some data and close it.
> # Timing 2, client append this file, write some data and close it.
> # Timing 3, Standby replayed the second closing edits of this file
> # Timing 4, Standby processes the blockReceivedAndDeleted of the first
> create operation
> # Timing 5, Standby processed the blockReceivedAndDeleted of the second
> append operation
> # Timing 6, Admin switched the active namenode from NN1 to NN2
> # Timing 7, client failed to append some data to this file.
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): append:
> lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is
> not sufficiently replicated yet.
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]