[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.

ASF GitHub Bot (Jira) Mon, 18 Mar 2024 10:15:02 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828054#comment-17828054
 ]


ASF GitHub Bot commented on HDFS-17430:
---------------------------------------

hadoop-yetus commented on PR #6635:
URL: https://github.com/apache/hadoop/pull/6635#issuecomment-2004489045

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 23s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  36m 24s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 48s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 37s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 47s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  9s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m  4s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m  8s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 38s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 40s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 44s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 23s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 213m 20s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6635/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 27s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 312m 40s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSAdmin |
   |   | hadoop.metrics2.sink.TestRollingFileSystemSinkWithHdfs |
   |   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6635/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6635 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f3a291500e15 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 6376ed89e382dae4276aeeff3f7dba2def8ead7d |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6635/2/testReport/ |
   | Max. process+thread count | 4287 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6635/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> RecoveringBlock will skip no live replicas when get block recovery command.
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-17430
>                 URL: https://issues.apache.org/jira/browse/HDFS-17430
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Haiyang Hu
>            Assignee: Haiyang Hu
>            Priority: Major
>              Labels: pull-request-available
>
> RecoveringBlock maybe skip no live replicas when get block recovery command.
> *Issue:*
> Currently the following scenarios may lead to failure in the execution of 
> BlockRecoveryWorker by the datanode, resulting file being not to be closed 
> for a long time.
> *t1.*  The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down 
> and will be dead status, the dn2 is live status.
> *t2.* Occurs block recovery.
> related logs：
> {code:java}
> 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* 
> NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
> recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx
> {code}
> *t3.*  The dn2 is chosen for block recovery.
> dn1 is marked as stale (is dead state) at this time, here the 
> recoveryLocations size is 1, currently according to the following logic, dn1 
> and dn2 will be chosen to participate in block recovery.
> DatanodeManager#getBlockRecoveryCommand
> {code:java}
>    // Skip stale nodes during recovery
>      final List<DatanodeStorageInfo> recoveryLocations =
>          new ArrayList<>(storages.length);
>      final List<Integer> storageIdx = new ArrayList<>(storages.length);
>      for (int i = 0; i < storages.length; ++i) {
>        if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {
>          recoveryLocations.add(storages[i]);
>          storageIdx.add(i);
>        }
>      }
>      ...
>      // If we only get 1 replica after eliminating stale nodes, choose all
>      // replicas for recovery and let the primary data node handle failures.
>      DatanodeInfo[] recoveryInfos;
>      if (recoveryLocations.size() > 1) {
>        if (recoveryLocations.size() != storages.length) {
>          LOG.info("Skipped stale nodes for recovery : "
>              + (storages.length - recoveryLocations.size()));
>        }
>        recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
>      } else {
>        // If too many replicas are stale, then choose all replicas to
>        // participate in block recovery.
>        recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
>      }
> {code}
> {code:java}
> 2024-03-13 21:58:01,425 INFO  datanode.DataNode 
> (BlockRecoveryWorker.java:logRecoverBlock(563))
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
> BlockRecoveryWorker: NameNode at xxx:8040 calls 
> recoverBlock(BP-xxx:blk_xxx_xxx, 
> targets=[DatanodeInfoWithStorage[dn1:50010,null,null], 
> DatanodeInfoWithStorage[dn2:50010,null,null]], 
> newGenerationStamp=28577373754, newBlock=null, isStriped=false)
> {code}
> *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call 
> initReplicaRecovery operation on dn1, however, since the dn1 machine is 
> currently down state at this time, it will take a very long time to timeout,  
> the default number of retries to establish a server connection is 45 times.
> related logs：
> {code:java}
> 2024-03-13 21:59:31,518 INFO  ipc.Client 
> (Client.java:handleConnectionTimeout(904)) 
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - 
> Retrying connect to server: dn1:8010. Already tried 0 time(s); maxRetries=45
> ...
> 2024-03-13 23:05:35,295 INFO  ipc.Client 
> (Client.java:handleConnectionTimeout(904)) 
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - 
> Retrying connect to server: dn2:8010. Already tried 44 time(s); maxRetries=45
> 2024-03-13 23:07:05,392 WARN  protocol.InterDatanodeProtocol 
> (BlockRecoveryWorker.java:recover(170)) 
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
> Failed to recover block (block=BP-xxx:blk_xxx_xxx, 
> datanode=DatanodeInfoWithStorage[dn1:50010,null,null]) 
> org.apache.hadoop.net.ConnectTimeoutException:
> Call From dn2 to dn1:8010 failed on socket timeout exception: 
> org.apache.hadoop.net.ConnectTimeoutException: 90000 millis timeout while 
> waiting for channel to be ready for connect.ch : 
> java.nio.channels.SocketChannel[connection-pending remote=dn:8010]; For more 
> details see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:866)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1511)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:268)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:142)
>         at com.sun.proxy.$Proxy23.initReplicaRecovery(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatano
> deProtocolTranslatorPB.java:83)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.callInitReplicaRecovery(BlockRecoveryWor
> ker.java:579)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker.access$400(BlockRecoveryWorker.java:57)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockReco
> veryWorker.java:135)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:620)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 90000 millis 
> timeout while waiting for channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending remote=dn1:8010]
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:607)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3900(Client.java:346)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         ... 10 more
> {code}
> *t5.*  The user or nn sends some new recovery command within the timeout 
> period.
> related logs：
> {code:java}
> 2024-03-13 22:13:01.158 WARN hdfs.StateChange DIR* 
> NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
> recovery is in progress. RecoveryId = 28577807097 for block blk_xxx_xxx
> 2024-03-13 22:58:02.701 WARN hdfs.StateChange DIR* 
> NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
> recovery is in progress. RecoveryId = 28578772548 for block blk_xxx_xxx
> {code}
> *t6.* Current 
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] will 
> continue run ReplicaRecovery on dn2, but the final call to 
> commitBlockSynchronization fails because the current recovery ID is smaller 
> than the recovery ID on the nn .
> {code:java}
> 2024-03-13 23:07:05,401 WARN  datanode.DataNode 
> (BlockRecoveryWorker.java:run(623)) 
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
> recover Block: RecoveringBlock{BP-xxx:blk_xxx_xxx; getBlockSize()=0; 
> corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[dn1:50010,null,null], 
> DatanodeInfoWithStorage[dn2:50010,null,null]]; cachedLocs=[]}
> FAILED: {} org.apache.hadoop.ipc.RemoteException(java.io.IOException): The 
> recovery id 28577373754 does not match current recovery id 28578772548 for 
> block BP-xxx:blk_xxx_xxx
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4129)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:1184)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:310)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:34391)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:635)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:603)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:587)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1137)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1236)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1134)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2005)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3360)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1511)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:268)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:142)
>         at com.sun.proxy.$Proxy17.commitBlockSynchronization(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolClientSideTranslatorPB.java:342)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:334)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:189)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:620)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> *t7.* Due to the above problems, the user or nn will continue to send some 
> new recovery commands, which will cause each block recovery to fail.
> *t8.* The recovery operation will success until the abnormal dn1 machine is 
> recovered.
> So we should fix this issue skip the no live replicas when building the 
> BlockRecoveryCommand to avoid the situation that causes recovery failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.

Reply via email to