[
https://issues.apache.org/jira/browse/HADOOP-10127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834463#comment-13834463
]
Karthik Kambatla commented on HADOOP-10127:
-------------------------------------------
Thanks [~stevel] for clarifying the potential issues arising out of setting a
higher frequency for retries.
The context for this is indeed YARN-1028 - ConfiguredFailoverProxy for RM
failover. In an HA setting where the second RM is the active, with the current
default for ipc.client.connect.max.retries (10), Clients / AMs / NMs retry the
first RM for 10 seconds before trying the second RM. This leads to a
significant performance hit. This delay in the clients failing over can be
mitigated by setting ipc.client.connect.max.retries to 1, but I thought there
might be merit to connect to the same RM multiple times (> 1) before trying the
other one. Hence, the proposal to allow making the retry-interval shorter - try
connecting to the same RM twice with a delay of half-a-second before failing
over.
bq. If it really is NM->RM calls you are worried about, then perhaps rather
than make changes to the general IPC client, this is a good time to impose a
better retry policy here, where exponential backoff with jitter is what I'd
propose.
Even if we improve the retry policy in {Client|Server}*RMProxy, the
{{ipc.Client}} delay of 10 seconds to failover still exists. What do you think
of making the general Client dumb enough to try connecting only once and let
the higher layers take care of the actual retry policies? I know that would be
a significant change, but worth making?
> Add ipc.client.connect.retry.interval to control the frequency of connection
> retries
> ------------------------------------------------------------------------------------
>
> Key: HADOOP-10127
> URL: https://issues.apache.org/jira/browse/HADOOP-10127
> Project: Hadoop Common
> Issue Type: Bug
> Components: ipc
> Affects Versions: 2.2.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Attachments: hadoop-10127-1.patch
>
>
> Currently, {{ipc.Client}} client attempts to connect to the server every 1
> second. It would be nice to make this configurable to be able to connect
> more/less frequently. Changing the number of retries alone is not granular
> enough.
--
This message was sent by Atlassian JIRA
(v6.1#6144)