[ 
https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4754:
----------------------------
    Description: 
 

During our upgrade, we restart all DNs first, then stop the SCM, wait for a 
while, start it.

Current retry policy is retry sending with a 1s interval. 

Given at some time-point, all the DNs lost connection with the SCM at the same 
time, due to restart of SCM, all DNs will send container report to SCM nearly 
at the same time.

 

We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
  writeLock();
  try {
    if (scmMachines.containsKey(address)) {
      LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
          "Ignoring the request.");
      return;
    }

    Configuration hadoopConfig =
        LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
    RPC.setProtocolEngine(
        hadoopConfig,
        StorageContainerDatanodeProtocolPB.class,
        ProtobufRpcEngine.class);
    long version =
        RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);

    RetryPolicy retryPolicy =
        RetryPolicies.retryUpToMaximumCountWithFixedSleep(
            getScmRpcRetryCount(conf),
            1000, TimeUnit.MILLISECONDS);
{code}

  was:
 

During our upgrade, we restart all DNs first, then stop the SCM, wait for a 
while, start it.

Current retry policy is 

Given at some time-point, all the DNs lost connection with the SCM at the same 
time, they will 

 

We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
  writeLock();
  try {
    if (scmMachines.containsKey(address)) {
      LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
          "Ignoring the request.");
      return;
    }

    Configuration hadoopConfig =
        LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
    RPC.setProtocolEngine(
        hadoopConfig,
        StorageContainerDatanodeProtocolPB.class,
        ProtobufRpcEngine.class);
    long version =
        RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);

    RetryPolicy retryPolicy =
        RetryPolicies.retryUpToMaximumCountWithFixedSleep(
            getScmRpcRetryCount(conf),
            1000, TimeUnit.MILLISECONDS);
{code}


> A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.
> -------------------------------------------------------------------------
>
>                 Key: HDDS-4754
>                 URL: https://issues.apache.org/jira/browse/HDDS-4754
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: runzhiwang
>            Priority: Major
>         Attachments: 企业微信截图_1611734015772.png
>
>
>  
> During our upgrade, we restart all DNs first, then stop the SCM, wait for a 
> while, start it.
> Current retry policy is retry sending with a 1s interval. 
> Given at some time-point, all the DNs lost connection with the SCM at the 
> same time, due to restart of SCM, all DNs will send container report to SCM 
> nearly at the same time.
>  
> We propose to change datanode retry policy to connect SCM.
> {code:java}
> public void addSCMServer(InetSocketAddress address) throws IOException {
>   writeLock();
>   try {
>     if (scmMachines.containsKey(address)) {
>       LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
>           "Ignoring the request.");
>       return;
>     }
>     Configuration hadoopConfig =
>         LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
>     RPC.setProtocolEngine(
>         hadoopConfig,
>         StorageContainerDatanodeProtocolPB.class,
>         ProtobufRpcEngine.class);
>     long version =
>         RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
>     RetryPolicy retryPolicy =
>         RetryPolicies.retryUpToMaximumCountWithFixedSleep(
>             getScmRpcRetryCount(conf),
>             1000, TimeUnit.MILLISECONDS);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to