[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085463#comment-16085463
 ] 

Weiwei Yang commented on HDFS-12098:
------------------------------------

Ah found the difference after hours of debugging ... it's not that easy to get 
this reproduced from mini cluster, let me explain, the behavior is different 
from mini cluster and a real cluster setup,

*Mini Cluster*
In class {{MiniOzoneCluster}}, we are initiating SCM like

{code}
StorageContainerManager scm = new StorageContainerManager(conf);
f(!disableSCM) {
  // start SCM if it is not disabled.
  scm.start();
}
{code}

the constructor of scm will init scm datanode, client RPC servers.  During the 
initiation, {{RPC.Builder(conf)...build()}} will bind the RPC server to the 
specific port, once the port is bound, subsequent client RPC calls e.g

{code}
 SCMVersionResponseProto versionResponse =
          rpcEndPoint.getEndPoint().getVersion(null);
{code}

will try to connect that port and read data, however the service is not 
responding, thus it gets a {{SocketTimeout}}.

*Real Cluster*

However, in a real cluster environment. Scm constructor will not be called, so 
the port will not be bound. When the RPC client tries to connect to that port, 
it gets a {{connection refused error}}. This error is caught and triggered the 
RetryPolicy, that's where I saw 10 times of retry which causes this problem 
(thread leak).

I am not sure if it is worth to fix this problem in mini cluster, that probably 
needs to refactor the SCM constructor to move RPC init code out. Since this 
issue can be simply reproduced in a cluster setup following the steps in the 
description.

Please kindly advise. Thanks.

> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
>                 Key: HDFS-12098
>                 URL: https://issues.apache.org/jira/browse/HDFS-12098
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ozone, scm
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to