[
https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Glen Geng updated HDDS-4754:
----------------------------
Description:
During our upgrade, we restart all DNs first, then stop the SCM, wait for a
while, start it.
Current retry policy is retry sending with a 1s interval.
Given at some time-point, all the DNs lost connection with the SCM at the same
time, due to restart of SCM, all DNs will send container report to SCM nearly
at the same time.
We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
writeLock();
try {
if (scmMachines.containsKey(address)) {
LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
"Ignoring the request.");
return;
}
Configuration hadoopConfig =
LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
RPC.setProtocolEngine(
hadoopConfig,
StorageContainerDatanodeProtocolPB.class,
ProtobufRpcEngine.class);
long version =
RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(
getScmRpcRetryCount(conf),
1000, TimeUnit.MILLISECONDS);
{code}
was:
During our upgrade, we restart all DNs first, then stop the SCM, wait for a
while, start it.
Current retry policy is
Given at some time-point, all the DNs lost connection with the SCM at the same
time, they will
We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
writeLock();
try {
if (scmMachines.containsKey(address)) {
LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
"Ignoring the request.");
return;
}
Configuration hadoopConfig =
LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
RPC.setProtocolEngine(
hadoopConfig,
StorageContainerDatanodeProtocolPB.class,
ProtobufRpcEngine.class);
long version =
RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(
getScmRpcRetryCount(conf),
1000, TimeUnit.MILLISECONDS);
{code}
> A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.
> -------------------------------------------------------------------------
>
> Key: HDDS-4754
> URL: https://issues.apache.org/jira/browse/HDDS-4754
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Reporter: runzhiwang
> Priority: Major
> Attachments: 企业微信截图_1611734015772.png
>
>
>
> During our upgrade, we restart all DNs first, then stop the SCM, wait for a
> while, start it.
> Current retry policy is retry sending with a 1s interval.
> Given at some time-point, all the DNs lost connection with the SCM at the
> same time, due to restart of SCM, all DNs will send container report to SCM
> nearly at the same time.
>
> We propose to change datanode retry policy to connect SCM.
> {code:java}
> public void addSCMServer(InetSocketAddress address) throws IOException {
> writeLock();
> try {
> if (scmMachines.containsKey(address)) {
> LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
> "Ignoring the request.");
> return;
> }
> Configuration hadoopConfig =
> LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
> RPC.setProtocolEngine(
> hadoopConfig,
> StorageContainerDatanodeProtocolPB.class,
> ProtobufRpcEngine.class);
> long version =
> RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
> RetryPolicy retryPolicy =
> RetryPolicies.retryUpToMaximumCountWithFixedSleep(
> getScmRpcRetryCount(conf),
> 1000, TimeUnit.MILLISECONDS);
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]