[jira] [Updated] (HDDS-4754) A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.

Glen Geng (Jira) Wed, 27 Jan 2021 00:24:04 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-4754:
----------------------------
    Description: 
 

During our upgrade, we restart all DNs first, then stop the SCM, wait for a 
while, start it.

Current retry policy is 

Given at some time-point, all the DNs lost connection with the SCM at the same 
time, they will 

 

We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
  writeLock();
  try {
    if (scmMachines.containsKey(address)) {
      LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
          "Ignoring the request.");
      return;
    }

    Configuration hadoopConfig =
        LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
    RPC.setProtocolEngine(
        hadoopConfig,
        StorageContainerDatanodeProtocolPB.class,
        ProtobufRpcEngine.class);
    long version =
        RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);

    RetryPolicy retryPolicy =
        RetryPolicies.retryUpToMaximumCountWithFixedSleep(
            getScmRpcRetryCount(conf),
            1000, TimeUnit.MILLISECONDS);
{code}

  was:
We propose to change datanode retry policy to connect SCM.

 

 


> A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.
> -------------------------------------------------------------------------
>
>                 Key: HDDS-4754
>                 URL: https://issues.apache.org/jira/browse/HDDS-4754
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: runzhiwang
>            Priority: Major
>         Attachments: 企业微信截图_1611734015772.png
>
>
>  
> During our upgrade, we restart all DNs first, then stop the SCM, wait for a 
> while, start it.
> Current retry policy is 
> Given at some time-point, all the DNs lost connection with the SCM at the 
> same time, they will 
>  
> We propose to change datanode retry policy to connect SCM.
> {code:java}
> public void addSCMServer(InetSocketAddress address) throws IOException {
>   writeLock();
>   try {
>     if (scmMachines.containsKey(address)) {
>       LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
>           "Ignoring the request.");
>       return;
>     }
>     Configuration hadoopConfig =
>         LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
>     RPC.setProtocolEngine(
>         hadoopConfig,
>         StorageContainerDatanodeProtocolPB.class,
>         ProtobufRpcEngine.class);
>     long version =
>         RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
>     RetryPolicy retryPolicy =
>         RetryPolicies.retryUpToMaximumCountWithFixedSleep(
>             getScmRpcRetryCount(conf),
>             1000, TimeUnit.MILLISECONDS);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-4754) A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.

Reply via email to