[ 
https://issues.apache.org/jira/browse/HDFS-11740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996510#comment-15996510
 ] 

Weiwei Yang commented on HDFS-11740:
------------------------------------

Hi [~anu]

Thanks for your thoughtful comment, I appreciate it. Please see my answers below

Fixed Heartbeat - Pros:

bq. Simple to understand and write code. We are able to write good error 
messages like this...

This doesn't change. I tested on my cluster, it still shows same message as it 
is before.

bq. Fewer knobs to adjust – Since init, version and register are three states – 
we are optimizing the first 90 seconds of a datanodes life. Since datanodes are 
very long running processes, does this optimization matter?

I think it matters. It will be more states, if we let state transition sleeps a 
fixed interval (which is now the interval for node heartbeat to SCM), it might 
slow down the actual work. For example if in feature we want to support 
decommission a datanode from SCM, once it is done, transited the state to 
decommissioned. The decommission may take sometime and client is waiting on 
that, probably won't be happy if it needs to wait for more 30s until state 
changed. Right now is a good timing because there isn't many states, easy to 
change.

bq. If that retry is happening, let us say one SCM is dead or network issue – 
we don't want the scheduler to be running the next task immediately. We want 
some quite period since this is an admin task – and we should not be consuming 
too much resources. I am worried that RPC retry will happen till we time out 
and then due to this

This is true. If a task has some failure happened, I can set the interval to 
something else and ask scheduler to schedule next task after this time. This 
can be done within current patch. I will show that in v3 patch.

Thanks

> Ozone: Differentiate time interval for different DatanodeStateMachine state 
> tasks
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-11740
>                 URL: https://issues.apache.org/jira/browse/HDFS-11740
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>         Attachments: HDFS-11740-HDFS-7240.001.patch, 
> HDFS-11740-HDFS-7240.002.patch, statemachine_1.png, statemachine_2.png
>
>
> Currently datanode state machine transitioned between tasks in a fixed time 
> interval, defined by {{ScmConfigKeys#OZONE_SCM_HEARTBEAT_INTERVAL_SECONDS}}, 
> the default value is 30s. Once datanode is started, it will need 90s before 
> transited to {{Heartbeat}} state, such a long lag is not necessary. Propose 
> to improve the logic of time interval handling, it seems only the heartbeat 
> task needs to be scheduled in {{OZONE_SCM_HEARTBEAT_INTERVAL_SECONDS}} 
> interval, rest should be done without any lagging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to