[
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085463#comment-16085463
]
Weiwei Yang commented on HDFS-12098:
------------------------------------
Ah found the difference after hours of debugging ... it's not that easy to get
this reproduced from mini cluster, let me explain, the behavior is different
from mini cluster and a real cluster setup,
*Mini Cluster*
In class {{MiniOzoneCluster}}, we are initiating SCM like
{code}
StorageContainerManager scm = new StorageContainerManager(conf);
f(!disableSCM) {
// start SCM if it is not disabled.
scm.start();
}
{code}
the constructor of scm will init scm datanode, client RPC servers. During the
initiation, {{RPC.Builder(conf)...build()}} will bind the RPC server to the
specific port, once the port is bound, subsequent client RPC calls e.g
{code}
SCMVersionResponseProto versionResponse =
rpcEndPoint.getEndPoint().getVersion(null);
{code}
will try to connect that port and read data, however the service is not
responding, thus it gets a {{SocketTimeout}}.
*Real Cluster*
However, in a real cluster environment. Scm constructor will not be called, so
the port will not be bound. When the RPC client tries to connect to that port,
it gets a {{connection refused error}}. This error is caught and triggered the
RetryPolicy, that's where I saw 10 times of retry which causes this problem
(thread leak).
I am not sure if it is worth to fix this problem in mini cluster, that probably
needs to refactor the SCM constructor to move RPC init code out. Since this
issue can be simply reproduced in a cluster setup following the steps in the
description.
Please kindly advise. Thanks.
> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, ozone, scm
> Reporter: Weiwei Yang
> Assignee: Weiwei Yang
> Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch,
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png,
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state
> machine could transit to RUNNING. However in actual, its state transits to
> SHUTDOWN, datanode enters chill mode.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]