[
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082656#comment-16082656
]
Anu Engineer commented on HDFS-12098:
-------------------------------------
[~cheersyang] Thanks for reporting this and posting a patch. Before commenting
on this I would like to simulate this in our unit tests and then test with and
without your patch. I am going to modify MiniOzoneCluster and build it with
flags called *disableSCM* and *disableKSM*, so we can simulate SCM or KSM being
down. I will be able to explore the behavior in greater detail with that.
Some thoughts on this patch, if my understanding is correct, isn't the root
issue that we time out but forget to communicate to the running thread we have
already timed out ? I was wondering if we add a a AtomicBoolean to each task
which indicates if it has timed out, then perhaps when the thread comes out it
can understand the caller has timed out and it will exist that thread ? Do you
think it will address this issue ?
The reason why I am asking is that, if we pursue the approach of a single
thread -- then we have to create many state machines for various tasks -- like
many SCMs or running some complex SCM commands.
I am fine with that approach too , but something that I wanted to us to
consider.
> Ozone: Datanode is unable to register with scm if scm starts later
> ------------------------------------------------------------------
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, ozone, scm
> Reporter: Weiwei Yang
> Assignee: Weiwei Yang
> Priority: Critical
> Attachments: HDFS-12098-HDFS-7240.001.patch,
> HDFS-12098-HDFS-7240.002.patch, thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state
> machine could transit to RUNNING. However in actual, its state transits to
> SHUTDOWN, datanode enters chill mode.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]