[jira] [Comment Edited] (HDFS-11740) Ozone: Differentiate time interval for different DatanodeStateMachine state tasks

Anu Engineer (JIRA) Wed, 03 May 2017 11:56:27 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995416#comment-15995416
 ]


Anu Engineer edited comment on HDFS-11740 at 5/3/17 6:55 PM:
-------------------------------------------------------------

@Weiwei yang, Thanks for the v2 patch. Sorry for the long comment, just want to 
make sure that we are on the same page.

I am still not able to see the advantage of introducing this change.
So I am going to argue both sides to makes sure I understand the cost/benefits 
of this solutions.

Fixed Heartbeat - Pros:
# Simple to understand and write code. We are able to write good error messages 
like this. 
{{Unable to communicate to SCM server at server.hortonworks.com:9861. We have 
not been able to communicate to this SCM server for past 3300 seconds.}}
The above error message is from {{EndpointStateMachine.java#logIfneeded}}
# Fewer knobs to adjust -- Since init, version and register are three states -- 
we are optimizing the first 90 seconds of a datanodes life. Since datanodes are 
very long running processes, does this optimization matter ? 
# When you have a cluster with 3000+ datanodes, SCM might like that fact that 
datanodes are slow in reaching out to it.
# Also the first 90 seconds -- will be the time that datanode takes to read and 
get ready in most cases. So think of a datanode doing 2 things -- One is 
reading data of the local HDDs -- other is talking to SCM about its presence. 
This is a workflow that can proceed in parallel. In other words, they should 
not intermingle, unless we reach a place where one has to wait for another.
The first such point is when SCM sends a command to datanode that is it is not 
ready to handle yet. By giving 90 seconds to datanode before any such 
rendezvous point, we are avoiding a possible wait condition.

Fixed Heartbeat - Cons:
# Datanodes are wasting first 90 seconds. In a small cluster -- we can bootup 
much faster.
# When we add new states -- this might make the datanode waste more time.

I wanted to see your thoughts on the pros/cons argument on why we want to 
remove fixed heartbeats and move to variable heartbeats.

More specific things:
# Why the change in executor ? -- So that we can create a pre-planned set of 
futures ? Please see another of my comment below.
# I like that fact that each state specifies the wait time internally, but 
RegisterEndpointTask seems to wait 0 seconds ? 
# There is one big semantic difference -- The current code artifically creates 
lags -- for example -- the main loop does not run with a fxied cadence. 
Instead, it runs with a number of seconds from the last time the action was 
performed ( This a pattern we use both in SCM and datanode).
_This is a critical, since the RPC layer will/could retry._
If that retry is happening, let us say one SCM is dead or network issue -- we 
don't want the scheduler to be running the next task immediately. We  want some 
quite period since this is an admin task -- and we should not be consuming too 
much resources. I am worried that RPC retry will happen till we time out and 
then due to this
{code}
          ScheduledFuture taskFuture = executor.schedule(
          endpointTask,
          endpointTask.getTaskDuration(),
          TimeUnit.SECONDS);
{code}
an already queued task would fire immediately.


If you want to support this feature -- may I suggest that we make changes in 
{{DatanodeStates}}

* Add a time to wait value here.
* In {{start()}} read the wait value and sleep for that much duration -- that 
allows you to change each steps time duration.



was (Author: anu):
@Weiwei yang, Thanks for the v2 patch. Sorry for the long comment, just want to 
make sure that we are on the same page.

I am still not able to see the advantage of introducing this change.
So I am going to argue both sides to makes sure I understand the cost/benefits 
of this solutions.

Fixed Heartbeat - Pros:
# Simple to understand and write code. We are able to write good error messages 
like this. 
{{Unable to communicate to SCM server at server.hortonworks.com:9861. We have 
not been able to communicate to this SCM server for past 3300 seconds.}}
The above error message is from {{EndpointStateMachine.java#logIfneeded}}
# Fewer knobs to adjust -- Since init, version and register are three states -- 
we are optimizing the first 90 seconds of a datanodes life. Since datanodes are 
very long running processes, does this optimization matter ? 
# When you have a cluster with 3000+ datanodes, SCM might like that fact that 
datanodes are slow in reaching out to it.
# Also the first 90 seconds -- will be the time that datanode takes to read and 
get ready in most cases. So think of a datanode doing 2 things -- One is 
reading data of the local HDDs -- other is talking to SCM about its presence. 
This is a workflow that can proceed in parallel. In other words, they should 
not intermingle, unless we reach a place where one has to wait for another.
The first such point is when SCM sends a command to datanode that is it is not 
ready to handle yet. By giving 90 seconds to datanode before any such 
rendezvous point, we are avoiding a possible wait condition.

Fixed Heartbeat - Cons:
# A datanodes are wasting first 90 seconds. In a small cluster -- we can bootup 
much faster.
# When we add new states -- this might make the datanode waste more time.

I wanted to see your thoughts on the pros/cons argument on why we want to 
remove fixed heartbeats and move to variable heartbeats.

More specific things:
# Why the change in executor ? -- So that we can create a pre-planned set of 
futures ? Please see another of my comment below.
# I like that fact that each state specifies the wait time internally, but 
RegisterEndpointTask seems to wait 0 seconds ? 
# There is one big semantic difference -- The current code artifically creates 
lags -- for example -- the main loop does not run with a fxied cadence. 
Instead, it runs with a number of seconds from the last time the action was 
performed ( This a pattern we use both in SCM and datanode).
_This is a critical, since the RPC layer will/could retry._
If that retry is happening, let us say one SCM is dead or network issue -- we 
don't want the scheduler to be running the next task immediately. We  want some 
quite period since this is an admin task -- and we should not be consuming too 
much resources. I am worried that RPC retry will happen till we time out and 
then due to this
{code}
          ScheduledFuture taskFuture = executor.schedule(
          endpointTask,
          endpointTask.getTaskDuration(),
          TimeUnit.SECONDS);
{code}
an already queued task would fire immediately.


If you want to support this feature -- may I suggest that we make changes in 
{{DatanodeStates}}

* Add a time to wait value here.
* In {{start()}} read the wait value and sleep for that much duration -- that 
allows you to change each steps time duration.


> Ozone: Differentiate time interval for different DatanodeStateMachine state 
> tasks
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-11740
>                 URL: https://issues.apache.org/jira/browse/HDFS-11740
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>         Attachments: HDFS-11740-HDFS-7240.001.patch, 
> HDFS-11740-HDFS-7240.002.patch, statemachine_1.png, statemachine_2.png
>
>
> Currently datanode state machine transitioned between tasks in a fixed time 
> interval, defined by {{ScmConfigKeys#OZONE_SCM_HEARTBEAT_INTERVAL_SECONDS}}, 
> the default value is 30s. Once datanode is started, it will need 90s before 
> transited to {{Heartbeat}} state, such a long lag is not necessary. Propose 
> to improve the logic of time interval handling, it seems only the heartbeat 
> task needs to be scheduled in {{OZONE_SCM_HEARTBEAT_INTERVAL_SECONDS}} 
> interval, rest should be done without any lagging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-11740) Ozone: Differentiate time interval for different DatanodeStateMachine state tasks

Reply via email to