xinglin opened a new pull request, #5700:
URL: https://github.com/apache/hadoop/pull/5700
### Description of PR
Added support to fail fast when detecting unreachable/irresponsible standby
NN in ObserverReaderProxy
### How was this patch tested?
* Unit tests
```
~/p/h/t/hadoop-hdfs-project (HDFS-17030)> mvn test
-Dtest="TestObserverReadProxyProvider.java"
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running
org.apache.hadoop.hdfs.server.namenode.ha.TestObserverReadProxyProvider
[INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
1.136 s - in
org.apache.hadoop.hdfs.server.namenode.ha.TestObserverReadProxyProvider
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0
```
* Tested in a testing cluster
+ We take a heap dump at a standby NN.
```
bash-4.2$ jmap -F -dump:format=b,file=heapdump-25801.hprof 25801
Attaching to process ID 25801, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.172-b11
Dumping heap to heapdump-25801.hprof ...
```
+ Existing hadoop-binary took more than 2 mins to complete the List
operation, because we set _ipc.client.rpc-timeout.ms_ to 2 mins.
```
[xinglin@ltx1-hcl14866 ~]$ time hdfs dfs -ls /tmp/testFile.txt
23/05/24 23:07:05 INFO fs.FileBasedMountTableLoader: TID: 1 -
Loading mount table from
hdfs://ltx1-yugiohnn01.grid.linkedin.com:9000/mounttable/linkfs/ltx1-yugioh-router-mountpoints.json.
-rw-r--r-- 3 xinglin user 15841 2023-05-18 22:18
/tmp/testFile.txt
real 2m4.161s
user 0m5.052s
sys 0m0.322s
```
+ New binary completed the List operation in under 10 seconds.
```
[xinglin@ltx1-hcl14866 hadoop-bin_2100506]$ time hdfs dfs -ls
/tmp/testFile.txt 2>log2.txt 1>&1
-rw-r--r-- 3 xinglin user 15841 2023-05-18 22:18
/tmp/testFile.txt
real 0m7.399s
user 0m5.091s
sys 0m0.274s
```
+ Relevant log lines. Note the 5 second delay (07:12 -> 07:17).
```
23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: HA State for
ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000 is active
23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Changed
current proxy from none to
ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000
23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Skipping proxy
ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000 for getBlockLocations
because it is in state active
23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: HA State for
ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000 is standby
23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Changed
current proxy from ltx1-yugiohnn01-ha1.grid.linkedin.com/10.150.1.132:9000 to
ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000
23/05/24 23:07:12 DEBUG ha.ObserverReadProxyProvider: Skipping proxy
ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000 for getBlockLocations
because it is in state standby
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Cancel NN
probe task due to timeout for
ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Changed
current proxy from ltx1-yugiohnn01-ha2.grid.linkedin.com/10.150.1.133:9000 to
ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Skipping proxy
ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000 for getBlockLocations
because it is in state unreachable
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: HA State for
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000 is observer
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Changed
current proxy from ltx1-yugiohnn01-ha3.grid.linkedin.com/10.150.1.245:9000 to
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Attempting to
service getBlockLocations using proxy
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Invocation of
getBlockLocations using ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
was successful
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Attempting to
service getServerDefaults using proxy
ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
23/05/24 23:07:17 DEBUG ha.ObserverReadProxyProvider: Invocation of
getServerDefaults using ltx1-yugiohnn01-ha4.grid.linkedin.com/10.150.1.147:9000
was successful
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]