[jira] [Commented] (HDFS-16987) NameNode should remove all invalid corrupted blocks when starting active service

ASF GitHub Bot (Jira) Sun, 23 Apr 2023 00:26:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715395#comment-17715395
 ]


ASF GitHub Bot commented on HDFS-16987:
---------------------------------------

ZanderXu opened a new pull request, #5583:
URL: https://github.com/apache/hadoop/pull/5583

   In our prod environment, we encountered an incident where HA failover caused 
some new corrupted blocks, causing some jobs to fail.
   
   Traced down and found a bug in the processing of all pending DN messages 
when starting active services.
   Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is 
unstable
   
   The steps to reproduce are as follows:
   
   - Timing 1, client create a file, write some data and close it.
   
   - Timing 2, client append this file, write some data and close it.
   
   - Timing 3, Standby replayed the second closing edits of this file
   
   - Timing 4, Standby processes the blockReceivedAndDeleted of the first 
create operation
   
   - Timing 5, Standby processed the blockReceivedAndDeleted of the second 
append operation
   
   - Timing 6, Admin switched the active namenode from NN1 to NN2
   
   - Timing 7, client failed to append some data to this file.
   
   ```
   org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: 
lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not 
sufficiently replicated yet.
       at 
org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
       at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
       at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
       at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
       at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
       at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
       at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
       at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) 
   ```
   




> NameNode should remove all invalid corrupted blocks when starting active 
> service
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-16987
>                 URL: https://issues.apache.org/jira/browse/HDFS-16987
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Critical
>
> In our prod environment, we encountered an incident where HA failover caused 
> some new corrupted blocks, causing some jobs to fail.
>  
> Traced down and found a bug in the processing of all pending DN messages when 
> starting active services.
> The steps to reproduce are as follows:
>  # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is 
> unstable
>  # Timing 1, client create a file, write some data and close it.
>  # Timing 2, client append this file, write some data and close it.
>  # Timing 3, Standby replayed the second closing edits of this file
>  # Timing 4, Standby processes the blockReceivedAndDeleted of the first 
> create operation
>  # Timing 5, Standby processed the blockReceivedAndDeleted of the second 
> append operation
>  # Timing 6, Admin switched the active namenode from NN1 to NN2
>  # Timing 7, client failed to append some data to this file.
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: 
> lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is 
> not sufficiently replicated yet.
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16987) NameNode should remove all invalid corrupted blocks when starting active service

Reply via email to