[
https://issues.apache.org/jira/browse/HDFS-17030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728171#comment-17728171
]
ASF GitHub Bot commented on HDFS-17030:
---------------------------------------
xinglin commented on code in PR #5700:
URL: https://github.com/apache/hadoop/pull/5700#discussion_r1212424086
##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestObserverReadProxyProvider.java:
##########
@@ -325,6 +357,94 @@ public void testObserverRetriableException() throws
Exception {
assertHandledBy(1);
}
+ /**
+ * Happy case for GetHAServiceStateWithTimeout.
+ */
+ @Test
+ public void testGetHAServiceStateWithTimeout() throws Exception {
+ setupProxyProvider(1);
+ final HAServiceState state = HAServiceState.STANDBY;
+ NNProxyInfo<ClientProtocol> dummyNNProxyInfo =
+ (NNProxyInfo<ClientProtocol>) mock(NNProxyInfo.class);
+ Future<HAServiceState> task = mock(Future.class);
+ when(task.get(anyLong(), any(TimeUnit.class))).thenReturn(state);
+
+ HAServiceState state2 =
+ proxyProvider.getHAServiceStateWithTimeout(dummyNNProxyInfo, task);
+ assertEquals(state, state2);
+ verify(task).get(anyLong(), any(TimeUnit.class));
+ verifyNoMoreInteractions(task);
+ verify(logger).debug(startsWith("HA State for"));
+ }
+
+ /**
+ * Test TimeoutException for GetHAServiceStateWithTimeout.
+ */
+ @Test
+ public void testTimeoutExceptionGetHAServiceStateWithTimeout()
+ throws Exception {
+ setupProxyProvider(1);
+ NNProxyInfo<ClientProtocol> dummyNNProxyInfo =
+ (NNProxyInfo<ClientProtocol>) Mockito.mock(NNProxyInfo.class);
+ Future<HAServiceState> task = mock(Future.class);
+ when(task.get(anyLong(), any(TimeUnit.class))).thenThrow(
+ new TimeoutException("Timeout"));
+
+ HAServiceState state =
+ proxyProvider.getHAServiceStateWithTimeout(dummyNNProxyInfo, task);
+ assertNull(state);
+ verify(task).get(anyLong(), any(TimeUnit.class));
+ verify(task).cancel(true);
+ verifyNoMoreInteractions(task);
+ verify(logger).debug(startsWith("Cancel NN probe task due to timeout
for"));
+ }
+
+ /**
+ * Test InterruptedException for GetHAServiceStateWithTimeout.
+ * Tests for the other two exceptions are the same and thus left out.
+ */
+ @Test
+ public void testInterruptedExceptionGetHAServiceStateWithTimeout()
+ throws Exception {
+ setupProxyProvider(1);
+ NNProxyInfo<ClientProtocol> dummyNNProxyInfo =
+ (NNProxyInfo<ClientProtocol>) Mockito.mock(NNProxyInfo.class);
+ Future<HAServiceState> task = mock(Future.class);
+ when(task.get(anyLong(), any(TimeUnit.class))).thenThrow(
+ new InterruptedException("Interrupted"));
+
+ HAServiceState state =
+ proxyProvider.getHAServiceStateWithTimeout(dummyNNProxyInfo, task);
+ assertNull(state);
+ verify(task).get(anyLong(), any(TimeUnit.class));
+ verifyNoMoreInteractions(task);
+ verify(logger).debug(
+ startsWith("Interrupted exception in NN probe task for"));
+ }
+
+ /**
+ * Test InterruptedException for GetHAServiceStateWithTimeout.
+ * Tests for the other two exceptions are the same and thus left out.
Review Comment:
outdated comments. removed.
> Limit wait time for getHAServiceState in ObserverReaderProxy
> ------------------------------------------------------------
>
> Key: HDFS-17030
> URL: https://issues.apache.org/jira/browse/HDFS-17030
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Affects Versions: 3.4.0
> Reporter: Xing Lin
> Assignee: Xing Lin
> Priority: Minor
> Labels: pull-request-available
>
> When namenode HA is enabled and a standby NN is not responsible, we have
> observed it would take a long time to serve a request, even though we have a
> healthy observer or active NN.
> Basically, when a standby is down, the RPC client would (re)try to create
> socket connection to that standby for _ipc.client.connect.timeout_ _*
> ipc.client.connect.max.retries.on.timeouts_ before giving up. When we take a
> heap dump at a standby, the NN still accepts the socket connection but it
> won't send responses to these RPC requests and we would timeout after
> _ipc.client.rpc-timeout.ms._ This adds a significantly latency. For clusters
> at Linkedin, we set _ipc.client.rpc-timeout.ms_ to 120 seconds and thus a
> request takes more than 2 mins to complete when we take a heap dump at a
> standby. This has been causing user job failures.
> We could set _ipc.client.rpc-timeout.ms to_ a smaller value when sending
> getHAServiceState requests in ObserverReaderProxy (for user rpc requests, we
> still use the original value from the config). However, that would double the
> socket connection between clients and the NN (which is a deal-breaker).
> The proposal is to add a timeout on getHAServiceState() calls in
> ObserverReaderProxy and we will only wait for the timeout for an NN to
> respond its HA state. Once we pass that timeout, we will move on to probe the
> next NN.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]