[ 
https://issues.apache.org/jira/browse/HDFS-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945089#comment-17945089
 ] 

ASF GitHub Bot commented on HDFS-17769:
---------------------------------------

hadoop-yetus commented on PR #7602:
URL: https://github.com/apache/hadoop/pull/7602#issuecomment-2809832192

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 32s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  23m 55s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 36s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  trunk passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 37s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 42s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 59s |  |  the patch passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 39s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 47s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |   3m 53s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  85m  4s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.48 ServerAPI=1.48 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7602/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7602 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 7a5a56b605f0 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 
15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / e266a525fc3fabab83d91d11304ced95ab4beff9 |
   | Default Java | Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7602/8/testReport/ |
   | Max. process+thread count | 1139 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7602/8/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Allows client to actively retry to Active NameNode when the Observer NameNode 
> is too far behind client state id.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17769
>                 URL: https://issues.apache.org/jira/browse/HDFS-17769
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.4, 3.3.6, 3.4.1
>            Reporter: Guo Wei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.2
>
>         Attachments: 1.png, 2.png, 3.png
>
>
> When we use Router to forward read requests to the observer, if the cluster 
> experiences heavy write workloads, Observer nodes may fail to keep pace with 
> edit log synchronization, even if the dfs.ha.tail-edits.in-progress parameter 
> is configured, it may still occur.
> This triggers RetriableException: Observer Node is too far behind errors. 
> Especially when the client ipc.client.ping parameter is set to true, it will 
> strive to wait and constantly retry, which can cause the business to be 
> unable to obtain the desired data timely. We should consider having the 
> active namenode handle this at this time.
> Here are our some errors and repair verification:
> The stateid of the observer is too far behind the active:
> {code:java}
> // code placeholder
> Tue Apr 15 11:22:41 CST 2025, Active latest txId: 5698245512, Observer latest 
> txId:5695118653,Observer far behind: 3126859, time takes0s 
> Tue Apr 15 11:22:43 CST 2025, Active latest txId: 5698253145, Observer latest 
> txId:5695118653,Observer far behind: 3134492, time takes0s 
> Tue Apr 15 11:22:45 CST 2025, Active latest txId: 5698260942, Observer latest 
> txId:5695118653,Observer far behind: 3142289, time takes0s 
> Tue Apr 15 11:22:47 CST 2025, Active latest txId: 5698268614, Observer latest 
> txId:5695123653,Observer far behind: 3144961, time takes0s 
> Tue Apr 15 11:22:49 CST 2025, Active latest txId: 5698276490, Observer latest 
> txId:5695123653,Observer far behind: 3152837, time takes0s 
> Tue Apr 15 11:22:51 CST 2025, Active latest txId: 5698284361, Observer latest 
> txId:5695128653,Observer far behind: 3155708, time takes0s 
> Tue Apr 15 11:22:54 CST 2025, Active latest txId: 5698292641, Observer latest 
> txId:5695128653,Observer far behind: 3163988, time takes0s {code}
>  
> RetriableException:
> The client will throw a RetriableException and cannot connect to the router 
> for reading:
> {code:java}
> // code placeholder
> 10:16:53.744 [IPC Client (24555242) connection to routerIp:8888 from hdfs] 
> DEBUG org.apache.hadoop.ipc.Client - IPC Client (24555242) connection to 
> routerIp:8888 from hdfs: stopped, remaining connections 0 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
>  Observer Node is too far behind: serverStateId = 5695128653 clientStateId = 
> 5698292641 
>  at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source) 
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
>  at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:505)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:972)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:981)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:883)
>  
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1044)
>  
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>  
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) 
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1106) 
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029) 
>  at java.security.AccessController.doPrivileged(Native Method) 
>  at javax.security.auth.Subject.doAs(Subject.java:422) 
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>  
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3063) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
>  Observer Node is too far behind: serverStateId = 5632963133 clientStateId = 
> 5635526176 
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1513) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1410) 
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>  
>  at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source) 
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:966)
>  
>  at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:637)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:467)
>  
>  ... 15 more 
>  
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1584) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1529) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1426) 
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>  
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) 
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$getFileInfo$41(ClientNamenodeProtocolTranslatorPB.java:820)
>  
>  at 
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
>  
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:820)
>  
>  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadProxyProvider$RouterObserverReadInvocationHandler.invoke(RouterObserverReadProxyProvider.java:216)
>  
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
>  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
>  
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
>  at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1770) 
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1828)
>  
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1825)
>  
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1840)
>  
>  at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:611) 
>  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:468) 
>  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:432) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2592) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) 
>  at hadoop.write_then_observer_read2.main(write_then_observer_read2.java:64) 
> {code}
>  
> repair verification : 
> {code:java}
> // code placeholder
> (1) View the status of the cluster NameNode:[root@20w ~]# hdfs haadmin -ns 
> hh-rbf-test5 -getAllServiceState 
> 20w:8020                            active     
> 21w:8020                            standby    
> 22w:8020                            observer  
> (2) We enable the dfs.namenode.observer.too.stale.retry.active.enable 
> parameter and execute a read command on the 21w machine:[root@21w ~]# hdfs 
> dfs -cat /t.sh 
> /bin/ssh $1
> (3) The read RPC request can be found in hdfs-audit.log in the active 
> namennode, so the request is forwarded to the active namenode[root@20w ~]# 
> tail -f /data/disk02/var/log/hadoop/hdfs/hdfs-audit.log|grep t.sh 
> 2025-04-15 11:24:31,148 INFO FSNamesystem.audit: allowed=true   ugi=root 
> (auth:SIMPLE)  ip=/xx cmd=getfileinfo src=/t.sh       dst=null        
> perm=null       proto=rpc 
> 2025-04-15 11:24:31,461 INFO FSNamesystem.audit: allowed=true   ugi=root 
> (auth:SIMPLE)  ip=/xx cmd=open        src=/t.sh       dst=null        
> perm=null       proto=rpc
> (4) there are logs of retries to active in the observer log2025-04-15 
> 11:24:30,148 WARN  namenode.FSNamesystem 
> (GlobalStateIdContext.java:receiveRequestState(163)) - Retrying to Active 
> NameNode, Observer Node is too far behind: serverStateId = 5695393653 
> clientStateId = 5699337672 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to