[jira] [Commented] (HDDS-4754) A restarted SCM quickly go OOM due to ContainerReport Storm from DN cluster.

Yiqun Lin (Jira) Thu, 28 Jan 2021 06:44:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273774#comment-17273774
 ]


Yiqun Lin commented on HDDS-4754:
---------------------------------

Good catch, [~yjxxtd]!

I see currently DN heartbeat interval(hdds.heartbeat.interval) is 30s, so can 
we also make HddsConfigKeys#HDDS_HEARTBEAT_INTERVAL_DEFAULT as the retry 
interval here?

> A restarted SCM quickly go OOM due to ContainerReport Storm from DN cluster.
> ----------------------------------------------------------------------------
>
>                 Key: HDDS-4754
>                 URL: https://issues.apache.org/jira/browse/HDDS-4754
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: runzhiwang
>            Priority: Major
>         Attachments: 企业微信截图_1611734015772.png
>
>
> During tencent monthly upgrade, we restart all DNs first, then stop the SCM, 
> wait for a while, start it. SCM go OOM in a short time.
> Current retry policy of DN is retry sending with a 1s interval. Given at some 
> time-point, all the DNs lost connection with the SCM at the same time, due to 
> restart of SCM, all DNs will send container report to SCM nearly at the same 
> time, which is a ContainerReport Storm.
> We propose to change datanode retry policy to connect SCM.
> {code:java}
> public void addSCMServer(InetSocketAddress address) throws IOException {
>   writeLock();
>   try {
>     if (scmMachines.containsKey(address)) {
>       LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
>           "Ignoring the request.");
>       return;
>     }
>     Configuration hadoopConfig =
>         LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
>     RPC.setProtocolEngine(
>         hadoopConfig,
>         StorageContainerDatanodeProtocolPB.class,
>         ProtobufRpcEngine.class);
>     long version =
>         RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
>     RetryPolicy retryPolicy =
>         RetryPolicies.retryUpToMaximumCountWithFixedSleep(
>             getScmRpcRetryCount(conf),
>             1000, TimeUnit.MILLISECONDS);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-4754) A restarted SCM quickly go OOM due to ContainerReport Storm from DN cluster.

Reply via email to