Xing Lin created HDFS-17030:
-------------------------------
Summary: Limit wait time for getHAServiceState in
ObserverReaderProxy
Key: HDFS-17030
URL: https://issues.apache.org/jira/browse/HDFS-17030
Project: Hadoop HDFS
Issue Type: Improvement
Components: hdfs
Affects Versions: 3.4.0
Reporter: Xing Lin
When HA is enabled and a standby NN is not responsible (either when it is down
or a heap dump is being taken), we would wait for either
_socket_connection_timeout * socket_max_retries_on_connection_timeout_ or
_rpcTimeOut_ before moving on to the next NN. This adds a significantly
latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a
request would need to take more than 2 mins to complete when we take a heap
dump at a standby. This has been causing user job failures.
The proposal is to add a timeout on getHAServiceState() calls in
ObserverReaderProxy and we will only wait for the timeout for an NN to respond
its HA state. Once we pass that timeout, we will move on to the next NN.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]