xinglin opened a new pull request, #5700:
URL: https://github.com/apache/hadoop/pull/5700

   ### Description of PR
   Added support to fail fast when detecting unreachable/irresponsible standby 
NN in ObserverReaderProxy
   
   ### How was this patch tested?
   * Unit tests
   ```
   ~/p/h/t/hadoop-hdfs-project (HDFS-17030)> mvn test 
-Dtest="TestObserverReadProxyProvider.java"
   [INFO] -------------------------------------------------------
   [INFO]  T E S T S
   [INFO] -------------------------------------------------------
   [INFO] Running 
org.apache.hadoop.hdfs.server.namenode.ha.TestObserverReadProxyProvider
   [INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.136 s - in 
org.apache.hadoop.hdfs.server.namenode.ha.TestObserverReadProxyProvider
   [INFO]
   [INFO] Results:
   [INFO]
   [INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0
   ```
   * Tested in a testing cluster
       + We take a heap dump at a standby NN.
           ```
           bash-4.2$ jmap -F -dump:format=b,file=heapdump-25801.hprof 25801
           Attaching to process ID 25801, please wait...
           Debugger attached successfully.
           Server compiler detected.
           JVM version is 25.172-b11
           Dumping heap to heapdump-25801.hprof ... 
           ```
   
       + Existing hadoop-binary took more than 2 mins to complete the List 
operation, because we set _ipc.client.rpc-timeout.ms_ to 2 mins.
           ```
           [xinglin@ltx1-hcl14866 ~]$ time hdfs dfs -ls /tmp/testFile.txt
           23/05/24 23:07:05 INFO fs.FileBasedMountTableLoader: TID: 1 - 
Loading mount table from 
hdfs://ltx1-yugiohnn01.grid.linkedin.com:9000/mounttable/linkfs/ltx1-yugioh-router-mountpoints.json.
           -rw-r--r--   3 xinglin user      15841 2023-05-18 22:18 
/tmp/testFile.txt
           real 2m4.161s
           user 0m5.052s
           sys  0m0.322s
           ```
       + New binary completed the List operation in under 10 seconds. 
           ```
           [xinglin@ltx1-hcl14866 hadoop-bin_2100506]$ time hdfs dfs -ls 
/tmp/testFile.txt 2>log2.txt 1>&1
           -rw-r--r--   3 xinglin user      15841 2023-05-18 22:18 
/tmp/testFile.txt
           real 0m7.399s
           user 0m5.091s
           sys  0m0.274s
           ```
       + Relevant log lines. Note the 5 second delay (07:12 -> 07:17).
           ```
           23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: HA State for 
ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000 is active
           23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Changed 
current proxy from none to 
ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000
           23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Skipping proxy 
ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000 for getBlockLocations 
because it is in state active
           23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: HA State for 
ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000 is standby
           23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Changed 
current proxy from ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000 to 
ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000
           23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Skipping proxy 
ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000 for getBlockLocations 
because it is in state standby
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Cancel NN 
probe task due to timeout for 
ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Changed 
current proxy from ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000 to 
ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Skipping proxy 
ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000 for getBlockLocations 
because it is in state unreachable
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: HA State for 
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000 is observer
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Changed 
current proxy from ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000 to 
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Attempting to 
service getBlockLocations using proxy 
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Invocation of 
getBlockLocations using ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000 
was successful
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Attempting to 
service getServerDefaults using proxy 
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
           23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Invocation of 
getServerDefaults using ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000 
was successful
           ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to