[ 
https://issues.apache.org/jira/browse/HADOOP-10127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833654#comment-13833654
 ] 

Steve Loughran commented on HADOOP-10127:
-----------------------------------------

Which clients are you thinking of here? 
What we need to avoid is is overload on any failover restart operations at v. 
large scale clusters, where the scenarios are
# a master service fails, failover begins and all the worker nodes in the 
cluster generate large numbers of connect requests to the successor. 
# cluster power restart event where all started nodes/clients start hitting the 
booting services in near-perfect sync. I've seen this with flash-based devices 
where the boot time is constant for all nodes -it's why jitter is important, 
and why clock-based & time-since-boot jitter needs an extra bit of randomness
# server offline explicitly with heavy client load coming in from outside. Here 
the more clients that block retrying connection requests build up more and more 
pending calls, so the server ends up receiving a massive multiple of the normal 
working load the moment it goes live.
# more than one of the above problems. This is what led to the infamous 
facebook HDFS cascade failure -and hence why NN heartbeats now come in on a 
different RPC port from DFS client operations.

Shrink the retry time and the load generated against starting/failing over 
endpoints can increase massively. That doesn't mean it shouldn't be allowed 
-just that you need to understand that special problems arise at a few thousand 
servers and plan for it.

If it really is NM->RM calls you are worried about, then perhaps rather than 
make changes to the general IPC client, this is a good time to impose a better 
retry policy here, where exponential backoff with jitter is what I'd propose. 
The initial delay could be small, but it would back off fast if the cluster was 
down for any length of time

> Add ipc.client.connect.retry.interval to control the frequency of connection 
> retries
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-10127
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10127
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.2.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>         Attachments: hadoop-10127-1.patch
>
>
> Currently, {{ipc.Client}} client attempts to connect to the server every 1 
> second. It would be nice to make this configurable to be able to connect 
> more/less frequently. Changing the number of retries alone is not granular 
> enough.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to