[ https://issues.apache.org/jira/browse/HDFS-11740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995416#comment-15995416 ]
Anu Engineer edited comment on HDFS-11740 at 5/3/17 6:52 PM: ------------------------------------------------------------- @Weiwei yang, Thanks for the v2 patch. Sorry for the long comment, just want to make sure that we are on the same page. I am still not able to see the advantage of introducing this change. So I am going to argue both sides to makes sure I understand the cost/benefits of this solutions. Fixed Heartbeat - Pros: # Simple to understand and write code. We are able to write good error messages like this. {{Unable to communicate to SCM server at server.hortonworks.com:9861. We have not been able to communicate to this SCM server for past 3300 seconds.}} The above error message is from {{EndpointStateMachine.java#logIfneeded}} # Fewer knobs to adjust -- Since init, version and register are three states -- we are optimizing the first 90 seconds of a datanodes life. Since datanodes are very long running processes, does this optimization matter ? # When you have a cluster with 3000+ datanodes, SCM might like that fact that datanodes are slow in reaching out to it. # Also the first 90 seconds -- will be the time that datanode takes to read and get ready in most cases. So think of a datanode doing 2 things -- One is reading data of the local HDDs -- other is talking to SCM about its presence. This is a workflow that can proceed in parallel. In other words, they should not intermingle, unless we reach a place where one has to wait for another. The first such point is when SCM sends a command to datanode that is it is not ready to handle yet. By giving 90 seconds to datanode before any such rendezvous point, we are avoiding a possible wait condition. Fixed Heartbeat - Cons: # A datanodes are wasting first 90 seconds. In a small cluster -- we can bootup much faster. # When we add new states -- this might make the datanode waste more time. I wanted to see your thoughts on the pros/cons argument on why we want to remove fixed heartbeats and move to variable heartbeats. More specific things: # Why the change in executor ? -- So that we can create a pre-planned set of futures ? Please see another of my comment below. # I like that fact that each state specifies the wait time internally, but RegisterEndpointTask seems to wait 0 seconds ? #There is one big semantic difference -- The current code artifically creates lags -- for example -- the main loop does not run with a fxied cadence. Instead, it runs with a number of seconds from the last time the action was performed ( This a pattern we use both in SCM and datanode). _This is a critical, since the RPC layer will/could retry._ If that retry is happening, let us say one SCM is dead or network issue -- we don't want the scheduler to be running the next task immediately. We want some quite period since this is an admin task -- and we should not be consuming too much resources. I am worried that RPC retry will happen till we time out and then due to this {{ ScheduledFuture taskFuture = executor.schedule( endpointTask, endpointTask.getTaskDuration(), TimeUnit.SECONDS); }} an already queued task would fire immediately. If you want to support this feature -- may I suggest that we make changes in {{DatanodeStates}} * Add a time to wait value here. * In {{start()}} read the wait value and sleep for that much duration -- that allows you to change each steps time duration. was (Author: anu): @Weiwei yang, Thanks for the v2 patch. Sorry for the long comment, just want to make sure that we are on the same page. I am still not able to see the advantage of introducing this change. So I am going to argue both sides to makes sure I understand the cost/benefits of this solutions. Fixed Heartbeat - Pros: # Simple to understand and write code. We are able to write good error messages like this. {{Unable to communicate to SCM server at server.hortonworks.com:9861. We have not been able to communicate to this SCM server for past 3300 seconds.}} The above error message is from {{EndpointStateMachine.java#logIfneeded}} # Fewer knobs to adjust -- Since init, version and register are three states -- we are optimizing the first 90 seconds of a datanodes life. Since datanodes are very long running processes, does this optimization matter ? # When you have a cluster with 3000+ datanodes, SCM might like that fact that datanodes are slow in reaching out to it. # Also the first 90 seconds -- will be the time that datanode takes to read and get ready in most cases. So think of a datanode doing 2 things -- One is reading data of the local HDDs -- other is talking to SCM about its presence. This is a workflow that can proceed in parallel. In other words, they should not intermingle, unless we reach a place where one has to wait for another. The first such point is when SCM sends a command to datanode that is it is not ready to handle yet. By giving 90 seconds to datanode before any such rendezvous point arises, we are avoiding a possible wait condition. Fixed Heartbeat - Cons: # A datanodes are wasting first 90 seconds. In a small cluster -- we can bootup much faster. # When we add new states -- this might make the datanode waste more time. I wanted to see your thoughts on the pros/cons argument on why we want to remove fixed heartbeats and move to variable heartbeats. More specific things: # Why the change in executor ? -- So that we can create a pre-planned set of futures ? Please see another of my comment below. # I like that fact that each state specifies the wait time internally, but RegisterEndpointTask seems to wait 0 seconds ? #There is one big semantic difference -- The current code artifically creates lags -- for example -- the main loop does not run with a fxied cadence. Instead, it runs with a number of seconds from the last time the action was performed ( This a pattern we use both in SCM and datanode). _This is a critical, since the RPC layer will/could retry._ If that retry is happening, let us say one SCM is dead or network issue -- we don't want the scheduler to be running the next task immediately. We want some quite period since this is an admin task -- and we should not be consuming too much resources. I am worried that RPC retry will happen till we time out and then due to this {{ ScheduledFuture taskFuture = executor.schedule( endpointTask, endpointTask.getTaskDuration(), TimeUnit.SECONDS); }} an already queued task would fire immediately. If you want to support this feature -- may I suggest that we make changes in {{DatanodeStates}} * Add a time to wait value here. * In {{start()}} read the wait value and sleep for that much duration -- that allows you to change each steps time duration. > Ozone: Differentiate time interval for different DatanodeStateMachine state > tasks > --------------------------------------------------------------------------------- > > Key: HDFS-11740 > URL: https://issues.apache.org/jira/browse/HDFS-11740 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone > Reporter: Weiwei Yang > Assignee: Weiwei Yang > Attachments: HDFS-11740-HDFS-7240.001.patch, > HDFS-11740-HDFS-7240.002.patch, statemachine_1.png, statemachine_2.png > > > Currently datanode state machine transitioned between tasks in a fixed time > interval, defined by {{ScmConfigKeys#OZONE_SCM_HEARTBEAT_INTERVAL_SECONDS}}, > the default value is 30s. Once datanode is started, it will need 90s before > transited to {{Heartbeat}} state, such a long lag is not necessary. Propose > to improve the logic of time interval handling, it seems only the heartbeat > task needs to be scheduled in {{OZONE_SCM_HEARTBEAT_INTERVAL_SECONDS}} > interval, rest should be done without any lagging. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org