[jira] [Comment Edited] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

Weiwei Yang (JIRA) Fri, 07 Jul 2017 08:45:24 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078213#comment-16078213
 ]


Weiwei Yang edited comment on HDFS-12098 at 7/7/17 3:44 PM:
------------------------------------------------------------

This is because datanode state machine leaks {{VersionEndpointTask}} thread. In 
the case scm is not yet started, more and more {{VersionEndpointTask}} threads 
keep retrying connection with scm,

{noformat}
INIT - RUNNING 
                 \
                GETVERSION
                     new VersionEndpointTask submitted - retrying ...
                               ... (HB interval)
                     new VersionEndpointTask submitted - retrying ...
                               ... (HB interval)
                     new VersionEndpointTask submitted - retrying ...
                               ...
{noformat}

the version endpoint tasks are launched in HB interval (5s on my env), so every 
5s there is a new task submitted; the retry policy for each getVersion call is 
10 * 1s = 10s, so every 10s a task can be finished. So every 10s there will be 
ONE thread leak.

Please see [^thread_dump.log], there are 20 VersionEndpointTask threads in 
WAITING state. And this number keeps increasing.

When scm is up, all pending tasks will be able to connect to scm and getVersion 
call returns, so each of them will count the state to next, since the state is 
shared in {{EndpointStateMachine}}, it increments more than 1 so when I review 
the state changes, it looks like below

{noformat}
REGISTER
HEARTBEAT
SHUTDOWN
SHUTDOWN
SHUTDOWN
... 
{noformat}

To fix this, instead of using a central ExecutorService carried in 
{{DatanodeStateMachine}}, instead we could init a fixed size of thread pool to 
execute end point tasks, and make sure the thread pool gets shutdown before 
entering next state (at end of await).


was (Author: cheersyang):
This is because datanode state machine leaks {{VersionEndpointTask}} thread. In 
the case scm is not yet started,
 more and more {{VersionEndpointTask}} threads keep retrying connection with 
scm,

{noformat}
INIT - RUNNING 
                 \
                GETVERSION
                     new VersionEndpointTask submitted - retrying ...
                               ... (HB interval)
                     new VersionEndpointTask submitted - retrying ...
                               ... (HB interval)
                     new VersionEndpointTask submitted - retrying ...
                               ...
{noformat}

the version endpoint tasks are launched in HB interval (5s on my env), so every 
5s there is a new task submitted; the retry policy for each getVersion call is 
10 * 1s = 10s, so every 10s a task can be finished. So every 10s there will be 
ONE thread leak.

Please see [^thread_dump.log], there are 20 VersionEndpointTask threads in 
WAITING state. And this number keeps increasing.

When scm is up, all pending tasks will be able to connect to scm and getVersion 
call returns, so each of them will count the state to next, since the state is 
shared in {{EndpointStateMachine}}, it increments more than 1 so when I review 
the state changes, it looks like below

{noformat}
REGISTER
HEARTBEAT
SHUTDOWN
SHUTDOWN
SHUTDOWN
... 
{noformat}

To fix this, instead of using a central ExecutorService carried in 
{{DatanodeStateMachine}}, instead we could init a fixed size of thread pool to 
execute end point tasks, and make sure the thread pool gets shutdown before 
entering next state (at end of await).

> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
>                 Key: HDFS-12098
>                 URL: https://issues.apache.org/jira/browse/HDFS-12098
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ozone, scm
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Critical
>         Attachments: HDFS-12098-HDFS-7240.001.patch, thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

Reply via email to