[jira] [Commented] (HDFS-17769) Allows client to actively retry to Active NameNode when the Observer NameNode is too far behind client state id.

ASF GitHub Bot (Jira) Wed, 16 Apr 2025 03:29:14 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945033#comment-17945033
 ]


ASF GitHub Bot commented on HDFS-17769:
---------------------------------------

BsoBird commented on code in PR #7602:
URL: https://github.com/apache/hadoop/pull/7602#discussion_r2046629911


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/GlobalStateIdContext.java:
##########
@@ -156,9 +157,11 @@ public long receiveRequestState(RpcRequestHeaderProto 
header,
         ESTIMATED_TRANSACTIONS_PER_SECOND
             * TimeUnit.MILLISECONDS.toSeconds(clientWaitTime)
             * ESTIMATED_SERVER_TIME_MULTIPLIER) {
-      throw new RetriableException(
-          "Observer Node is too far behind: serverStateId = "
-              + serverStateId + " clientStateId = " + clientStateId);
+      String message = "Retrying to Active NameNode, Observer Node is too"
+          + " far behind: serverStateId = " + serverStateId
+          + " clientStateId = " + clientStateId;
+      FSNamesystem.LOG.warn(message);

Review Comment:
   Is it unnecessary to record logs here?





> Allows client to actively retry to Active NameNode when the Observer NameNode 
> is too far behind client state id.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17769
>                 URL: https://issues.apache.org/jira/browse/HDFS-17769
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.4, 3.3.6, 3.4.1
>            Reporter: Guo Wei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.2
>
>         Attachments: 1.png, 2.png, 3.png
>
>
> When we use Router to forward read requests to the observer, if the cluster 
> experiences heavy write workloads, Observer nodes may fail to keep pace with 
> edit log synchronization, even if the dfs.ha.tail-edits.in-progress parameter 
> is configured, it may still occur.
> This triggers RetriableException: Observer Node is too far behind errors. 
> Especially when the client ipc.client.ping parameter is set to true, it will 
> strive to wait and constantly retry, which can cause the business to be 
> unable to obtain the desired data timely. We should consider having the 
> active namenode handle this at this time.
> Here are our some errors and repair verification：
> The stateid of the observer is too far behind the active：
> {code:java}
> // code placeholder
> Tue Apr 15 11:22:41 CST 2025, Active latest txId: 5698245512, Observer latest 
> txId:5695118653，Observer far behind: 3126859, time takes0s 
> Tue Apr 15 11:22:43 CST 2025, Active latest txId: 5698253145, Observer latest 
> txId:5695118653，Observer far behind: 3134492, time takes0s 
> Tue Apr 15 11:22:45 CST 2025, Active latest txId: 5698260942, Observer latest 
> txId:5695118653，Observer far behind: 3142289, time takes0s 
> Tue Apr 15 11:22:47 CST 2025, Active latest txId: 5698268614, Observer latest 
> txId:5695123653，Observer far behind: 3144961, time takes0s 
> Tue Apr 15 11:22:49 CST 2025, Active latest txId: 5698276490, Observer latest 
> txId:5695123653，Observer far behind: 3152837, time takes0s 
> Tue Apr 15 11:22:51 CST 2025, Active latest txId: 5698284361, Observer latest 
> txId:5695128653，Observer far behind: 3155708, time takes0s 
> Tue Apr 15 11:22:54 CST 2025, Active latest txId: 5698292641, Observer latest 
> txId:5695128653，Observer far behind: 3163988, time takes0s {code}
>  
> RetriableException：
> The client will throw a RetriableException and cannot connect to the router 
> for reading:
> {code:java}
> // code placeholder
> 10:16:53.744 [IPC Client (24555242) connection to routerIp:8888 from hdfs] 
> DEBUG org.apache.hadoop.ipc.Client - IPC Client (24555242) connection to 
> routerIp:8888 from hdfs: stopped, remaining connections 0 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
>  Observer Node is too far behind: serverStateId = 5695128653 clientStateId = 
> 5698292641 
>  at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source) 
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
>  at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:505)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:972)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:981)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:883)
>  
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1044)
>  
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>  
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) 
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1106) 
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029) 
>  at java.security.AccessController.doPrivileged(Native Method) 
>  at javax.security.auth.Subject.doAs(Subject.java:422) 
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>  
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3063) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
>  Observer Node is too far behind: serverStateId = 5632963133 clientStateId = 
> 5635526176 
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1513) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1410) 
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>  
>  at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source) 
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:966)
>  
>  at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:637)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:467)
>  
>  ... 15 more 
>  
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1584) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1529) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1426) 
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>  
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) 
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$getFileInfo$41(ClientNamenodeProtocolTranslatorPB.java:820)
>  
>  at 
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
>  
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:820)
>  
>  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadProxyProvider$RouterObserverReadInvocationHandler.invoke(RouterObserverReadProxyProvider.java:216)
>  
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
>  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
>  
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
>  at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1770) 
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1828)
>  
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1825)
>  
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1840)
>  
>  at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:611) 
>  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:468) 
>  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:432) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2592) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) 
>  at hadoop.write_then_observer_read2.main(write_then_observer_read2.java:64) 
> {code}
>  
> repair verification : 
> {code:java}
> // code placeholder
> (1) View the status of the cluster NameNode:[root@20w ~]# hdfs haadmin -ns 
> hh-rbf-test5 -getAllServiceState 
> 20w:8020                            active     
> 21w:8020                            standby    
> 22w:8020                            observer  
> (2) We enable the dfs.namenode.observer.too.stale.retry.active.enable 
> parameter and execute a read command on the 21w machine:[root@21w ~]# hdfs 
> dfs -cat /t.sh 
> /bin/ssh $1
> (3) The read RPC request can be found in hdfs-audit.log in the active 
> namennode, so the request is forwarded to the active namenode[root@20w ~]# 
> tail -f /data/disk02/var/log/hadoop/hdfs/hdfs-audit.log|grep t.sh 
> 2025-04-15 11:24:31,148 INFO FSNamesystem.audit: allowed=true   ugi=root 
> (auth:SIMPLE)  ip=/xx cmd=getfileinfo src=/t.sh       dst=null        
> perm=null       proto=rpc 
> 2025-04-15 11:24:31,461 INFO FSNamesystem.audit: allowed=true   ugi=root 
> (auth:SIMPLE)  ip=/xx cmd=open        src=/t.sh       dst=null        
> perm=null       proto=rpc
> (4) there are logs of retries to active in the observer log2025-04-15 
> 11:24:30,148 WARN  namenode.FSNamesystem 
> (GlobalStateIdContext.java:receiveRequestState(163)) - Retrying to Active 
> NameNode, Observer Node is too far behind: serverStateId = 5695393653 
> clientStateId = 5699337672 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17769) Allows client to actively retry to Active NameNode when the Observer NameNode is too far behind client state id.

Reply via email to