[ https://issues.apache.org/jira/browse/HDFS-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945089#comment-17945089 ]
ASF GitHub Bot commented on HDFS-17769: --------------------------------------- hadoop-yetus commented on PR #7602: URL: https://github.com/apache/hadoop/pull/7602#issuecomment-2809832192 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 32s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 23m 55s | | trunk passed | | +1 :green_heart: | compile | 0m 41s | | trunk passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 0m 39s | | trunk passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 36s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 43s | | trunk passed | | +1 :green_heart: | javadoc | 0m 43s | | trunk passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 37s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 42s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 0m 36s | | the patch passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 36s | | the patch passed | | +1 :green_heart: | compile | 0m 33s | | the patch passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 0m 33s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 29s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 36s | | the patch passed | | +1 :green_heart: | javadoc | 0m 31s | | the patch passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 39s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 47s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 3m 53s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 23s | | The patch does not generate ASF License warnings. | | | | 85m 4s | | | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.48 ServerAPI=1.48 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7602/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7602 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 7a5a56b605f0 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / e266a525fc3fabab83d91d11304ced95ab4beff9 | | Default Java | Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7602/8/testReport/ | | Max. process+thread count | 1139 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7602/8/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Allows client to actively retry to Active NameNode when the Observer NameNode > is too far behind client state id. > ---------------------------------------------------------------------------------------------------------------- > > Key: HDFS-17769 > URL: https://issues.apache.org/jira/browse/HDFS-17769 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 3.3.4, 3.3.6, 3.4.1 > Reporter: Guo Wei > Priority: Major > Labels: pull-request-available > Fix For: 3.4.2 > > Attachments: 1.png, 2.png, 3.png > > > When we use Router to forward read requests to the observer, if the cluster > experiences heavy write workloads, Observer nodes may fail to keep pace with > edit log synchronization, even if the dfs.ha.tail-edits.in-progress parameter > is configured, it may still occur. > This triggers RetriableException: Observer Node is too far behind errors. > Especially when the client ipc.client.ping parameter is set to true, it will > strive to wait and constantly retry, which can cause the business to be > unable to obtain the desired data timely. We should consider having the > active namenode handle this at this time. > Here are our some errors and repair verification: > The stateid of the observer is too far behind the active: > {code:java} > // code placeholder > Tue Apr 15 11:22:41 CST 2025, Active latest txId: 5698245512, Observer latest > txId:5695118653,Observer far behind: 3126859, time takes0s > Tue Apr 15 11:22:43 CST 2025, Active latest txId: 5698253145, Observer latest > txId:5695118653,Observer far behind: 3134492, time takes0s > Tue Apr 15 11:22:45 CST 2025, Active latest txId: 5698260942, Observer latest > txId:5695118653,Observer far behind: 3142289, time takes0s > Tue Apr 15 11:22:47 CST 2025, Active latest txId: 5698268614, Observer latest > txId:5695123653,Observer far behind: 3144961, time takes0s > Tue Apr 15 11:22:49 CST 2025, Active latest txId: 5698276490, Observer latest > txId:5695123653,Observer far behind: 3152837, time takes0s > Tue Apr 15 11:22:51 CST 2025, Active latest txId: 5698284361, Observer latest > txId:5695128653,Observer far behind: 3155708, time takes0s > Tue Apr 15 11:22:54 CST 2025, Active latest txId: 5698292641, Observer latest > txId:5695128653,Observer far behind: 3163988, time takes0s {code} > > RetriableException: > The client will throw a RetriableException and cannot connect to the router > for reading: > {code:java} > // code placeholder > 10:16:53.744 [IPC Client (24555242) connection to routerIp:8888 from hdfs] > DEBUG org.apache.hadoop.ipc.Client - IPC Client (24555242) connection to > routerIp:8888 from hdfs: stopped, remaining connections 0 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException): > Observer Node is too far behind: serverStateId = 5695128653 clientStateId = > 5698292641 > at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:505) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:972) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:981) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:883) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1044) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1106) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3063) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException): > Observer Node is too far behind: serverStateId = 5632963133 clientStateId = > 5635526176 > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) > at org.apache.hadoop.ipc.Client.call(Client.java:1513) > at org.apache.hadoop.ipc.Client.call(Client.java:1410) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > > at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:966) > > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:637) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654) > > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:467) > > ... 15 more > > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1584) > at org.apache.hadoop.ipc.Client.call(Client.java:1529) > at org.apache.hadoop.ipc.Client.call(Client.java:1426) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > > at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$getFileInfo$41(ClientNamenodeProtocolTranslatorPB.java:820) > > at > org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:820) > > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadProxyProvider$RouterObserverReadInvocationHandler.invoke(RouterObserverReadProxyProvider.java:216) > > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366) > > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1770) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1828) > > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1825) > > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1840) > > at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:611) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:468) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:432) > at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2592) > at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) > at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) > at hadoop.write_then_observer_read2.main(write_then_observer_read2.java:64) > {code} > > repair verification : > {code:java} > // code placeholder > (1) View the status of the cluster NameNode:[root@20w ~]# hdfs haadmin -ns > hh-rbf-test5 -getAllServiceState > 20w:8020 active > 21w:8020 standby > 22w:8020 observer > (2) We enable the dfs.namenode.observer.too.stale.retry.active.enable > parameter and execute a read command on the 21w machine:[root@21w ~]# hdfs > dfs -cat /t.sh > /bin/ssh $1 > (3) The read RPC request can be found in hdfs-audit.log in the active > namennode, so the request is forwarded to the active namenode[root@20w ~]# > tail -f /data/disk02/var/log/hadoop/hdfs/hdfs-audit.log|grep t.sh > 2025-04-15 11:24:31,148 INFO FSNamesystem.audit: allowed=true ugi=root > (auth:SIMPLE) ip=/xx cmd=getfileinfo src=/t.sh dst=null > perm=null proto=rpc > 2025-04-15 11:24:31,461 INFO FSNamesystem.audit: allowed=true ugi=root > (auth:SIMPLE) ip=/xx cmd=open src=/t.sh dst=null > perm=null proto=rpc > (4) there are logs of retries to active in the observer log2025-04-15 > 11:24:30,148 WARN namenode.FSNamesystem > (GlobalStateIdContext.java:receiveRequestState(163)) - Retrying to Active > NameNode, Observer Node is too far behind: serverStateId = 5695393653 > clientStateId = 5699337672 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org