[jira] [Commented] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

ASF GitHub Bot (Jira) Thu, 01 Jun 2023 10:27:51 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728437#comment-17728437
 ]


ASF GitHub Bot commented on HDFS-17030:
---------------------------------------

xinglin commented on PR #5700:
URL: https://github.com/apache/hadoop/pull/5700#issuecomment-1572492538

   `hadoop.hdfs.server.namenode.ha.TestObserverNode` is a bit concerning but I 
tested this test from trunk branch. It also failed 2 out of 8 runs. That test 
is about stateID in AlignmentContext from observerReadProxy, which shouldn't be 
impacted by my change in this PR.
   
   ```
   mvn test -Dtest="TestObserverNode#testMkdirsRaceWithObserverRead" >> 
testObserverNode.log
   xinglin@xinglin-mn1 ~/p/h/t/h/hadoop-hdfs (trunk)> grep "Tests run" 
testObserverNode.log
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
14.893 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
14.416 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
16.565 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0
   [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
14.886 s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
15.432 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
15.905 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
13.871 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
14.389 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
   ```




> Limit wait time for getHAServiceState in ObserverReaderProxy
> ------------------------------------------------------------
>
>                 Key: HDFS-17030
>                 URL: https://issues.apache.org/jira/browse/HDFS-17030
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.4.0
>            Reporter: Xing Lin
>            Assignee: Xing Lin
>            Priority: Minor
>              Labels: pull-request-available
>
> When namenode HA is enabled and a standby NN is not responsible, we have 
> observed it would take a long time to serve a request, even though we have a 
> healthy observer or active NN. 
> Basically, when a standby is down, the RPC client would (re)try to create 
> socket connection to that standby for _ipc.client.connect.timeout_ _* 
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a 
> heap dump at a standby, the NN still accepts the socket connection but it 
> won't send responses to these RPC requests and we would timeout after 
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters 
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a 
> request takes more than 2 mins to complete when we take a heap dump at a 
> standby. This has been causing user job failures. 
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending 
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we 
> still use the original value from the config). However, that would double the 
> socket connection between clients and the NN (which is a deal-breaker). 
> The proposal is to add a timeout on getHAServiceState() calls in 
> ObserverReaderProxy and we will only wait for the timeout for an NN to 
> respond its HA state. Once we pass that timeout, we will move on to probe the 
> next NN. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

Reply via email to