haiyang1987 opened a new pull request, #6635:
URL: https://github.com/apache/hadoop/pull/6635

   ### Description of PR
   https://issues.apache.org/jira/browse/HDFS-17430
   
   RecoveringBlock maybe skip no live replicas when get block recovery command.
   
   
   **Issue:**
   Currently the following scenarios may lead to failure in the execution of 
BlockRecoveryWorker by the datanode, resulting file being not to be closed for 
a long time.
   
   **t1.** The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut 
down and will be dead status, the dn2 is live status.
   
   **t2.** Occurs block recovery.
   related logs:
   ```
   2024-03-13 21:58:00.651 WARN hdfs.StateChange        DIR* 
NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx
   ```
   
   **t3.** The dn2 is chosen for block recovery.
   dn1 is marked as stale (is dead state) at this time, here the 
recoveryLocations size is 1, currently according to the following logic, dn1 
and dn2 will be chosen to participate in block recovery.
   
   DatanodeManager#getBlockRecoveryCommand
   ```
      // Skip stale nodes during recovery
        final List<DatanodeStorageInfo> recoveryLocations =
            new ArrayList<>(storages.length);
        final List<Integer> storageIdx = new ArrayList<>(storages.length);
        for (int i = 0; i < storages.length; ++i) {
          if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {
            recoveryLocations.add(storages[i]);
            storageIdx.add(i);
          }
        }
        ...
        // If we only get 1 replica after eliminating stale nodes, choose all
        // replicas for recovery and let the primary data node handle failures.
        DatanodeInfo[] recoveryInfos;
        if (recoveryLocations.size() > 1) {
          if (recoveryLocations.size() != storages.length) {
            LOG.info("Skipped stale nodes for recovery : "
                + (storages.length - recoveryLocations.size()));
          }
          recoveryInfos = 
DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
        } else {
          // If too many replicas are stale, then choose all replicas to
          // participate in block recovery.
          recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
        }
   ```
   ```
   2024-03-13 21:58:01,425 INFO  datanode.DataNode 
(BlockRecoveryWorker.java:logRecoverBlock(563))
   [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
   BlockRecoveryWorker: NameNode at xxx:8040 calls 
recoverBlock(BP-xxx:blk_xxx_xxx, 
targets=[DatanodeInfoWithStorage[dn1:50010,null,null], 
DatanodeInfoWithStorage[dn2:50010,null,null]], newGenerationStamp=28577373754, 
newBlock=null, isStriped=false)
   ```
   
   **t4.** When dn2 executes BlockRecoveryWorker#recover, it will call 
initReplicaRecovery operation on dn1, however, since the dn1 machine is 
currently down state at this time, it will take a very long time to timeout, 
the default number of retries to establish a server connection is 45 times.
   related logs:
   
   ```
   2024-03-13 21:59:31,518 INFO  ipc.Client 
(Client.java:handleConnectionTimeout(904)) 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - 
Retrying connect to server: dn1:8010. Already tried 0 time(s); maxRetries=45
   ...
   2024-03-13 23:05:35,295 INFO  ipc.Client 
(Client.java:handleConnectionTimeout(904)) 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - 
Retrying connect to server: dn2:8010. Already tried 44 time(s); maxRetries=45
   
   2024-03-13 23:07:05,392 WARN  protocol.InterDatanodeProtocol 
(BlockRecoveryWorker.java:recover(170)) 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
   Failed to recover block (block=BP-xxx:blk_xxx_xxx, 
datanode=DatanodeInfoWithStorage[dn1:50010,null,null]) 
org.apache.hadoop.net.ConnectTimeoutException:
   Call From dn2 to dn1:8010 failed on socket timeout exception: 
org.apache.hadoop.net.ConnectTimeoutException: 90000 millis timeout while 
waiting for channel to be ready for connect.ch : 
java.nio.channels.SocketChannel[connection-pending remote=dn:8010]; For more 
details see:  http://wiki.apache.org/hadoop/SocketTimeout
           at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
           at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
           at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
           at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
           at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
           at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:866)
           at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
           at org.apache.hadoop.ipc.Client.call(Client.java:1511)
           at org.apache.hadoop.ipc.Client.call(Client.java:1402)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:268)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:142)
           at com.sun.proxy.$Proxy23.initReplicaRecovery(Unknown Source)
           at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatano
   deProtocolTranslatorPB.java:83)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWor
   ker.java:579)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockReco
   veryWorker.java:135)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:620)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hadoop.net.ConnectTimeoutException: 90000 millis 
timeout while waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=dn1:8010]
           at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:607)
           at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
           at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
           at 
org.apache.hadoop.ipc.Client$Connection.access$3900(Client.java:346)
           at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
           at org.apache.hadoop.ipc.Client.call(Client.java:1449)
           ... 10 more
   ```
   
   **t5.** The user or nn sends some new recovery command within the timeout 
period.
   related logs:
   ```
   2024-03-13 22:13:01.158 WARN hdfs.StateChange        DIR* 
NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
recovery is in progress. RecoveryId = 28577807097 for block blk_xxx_xxx
   
   2024-03-13 22:58:02.701 WARN hdfs.StateChange        DIR* 
NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
recovery is in progress. RecoveryId = 28578772548 for block blk_xxx_xxx
   ```
   
   **t6.** Current 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] will 
continue run ReplicaRecovery on dn2, but the final call to 
commitBlockSynchronization fails because the current recovery ID is smaller 
than the recovery ID on the nn .
   ```
   2024-03-13 23:07:05,401 WARN  datanode.DataNode 
(BlockRecoveryWorker.java:run(623)) 
[org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
   recover Block: RecoveringBlock{BP-xxx:blk_xxx_xxx; getBlockSize()=0; 
corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[dn1:50010,null,null], 
DatanodeInfoWithStorage[dn2:50010,null,null]]; cachedLocs=[]}
   FAILED: {} org.apache.hadoop.ipc.RemoteException(java.io.IOException): The 
recovery id 28577373754 does not match current recovery id 28578772548 for 
block BP-xxx:blk_xxx_xxx
           at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4129)
           at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:1184)
           at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:310)
           at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:34391)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)
   
           at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
           at org.apache.hadoop.ipc.Client.call(Client.java:1511)
           at org.apache.hadoop.ipc.Client.call(Client.java:1402)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:268)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:142)
           at com.sun.proxy.$Proxy17.commitBlockSynchronization(Unknown Source)
           at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolClientSideTranslatorPB.java:342)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:334)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:189)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:620)
           at java.lang.Thread.run(Thread.java:748)
   ```
   
   **t7.** Due to the above problems, the user or nn will continue to send some 
new recovery commands, which will cause each block recovery to fail.
   
   **t8.** The recovery operation will success until the abnormal dn1 machine 
is recovered.
   
   So we should fix this issue skip the no live replicas when building the 
BlockRecoveryCommand to avoid the situation that causes recovery failure.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to