ZanderXu opened a new pull request, #5583: URL: https://github.com/apache/hadoop/pull/5583
In our prod environment, we encountered an incident where HA failover caused some new corrupted blocks, causing some jobs to fail. Traced down and found a bug in the processing of all pending DN messages when starting active services. Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is unstable The steps to reproduce are as follows: - Timing 1, client create a file, write some data and close it. - Timing 2, client append this file, write some data and close it. - Timing 3, Standby replayed the second closing edits of this file - Timing 4, Standby processes the blockReceivedAndDeleted of the first create operation - Timing 5, Standby processed the blockReceivedAndDeleted of the second append operation - Timing 6, Admin switched the active namenode from NN1 to NN2 - Timing 7, client failed to append some data to this file. ``` org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not sufficiently replicated yet. at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
