[
https://issues.apache.org/jira/browse/HDFS-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437711#comment-16437711
]
Anu Engineer commented on HDFS-13442:
-------------------------------------
Hi [~hanishakoneru],
Thanks for the patch. However, I feel that a data node should give up
registration attempt after a really long time or under a condition of error.
Retrying 10 times seem too low. For example, if the data nodes boot up earlier
than SCM we would not want the data nodes to do silent after 10 tries
(somewhere around 5 minutes) , If we are going to do a default value for max
retries, we should try to target something in the order of days, say 24 hours
or so.
In fact, we can read the HB frequency config value and multiply that to get
24/12 hours.
also in the case, we get the error, _errorNodeNotPermitted_, should we shut
down the data node and create some kind of error record on SCM so we can get
that info back from SCM? I am also ok with the current approach where we will
let the system slowly go time out.
> Ozone: Handle Datanode Registration failure
> -------------------------------------------
>
> Key: HDFS-13442
> URL: https://issues.apache.org/jira/browse/HDFS-13442
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ozone
> Affects Versions: HDFS-7240
> Reporter: Hanisha Koneru
> Assignee: Hanisha Koneru
> Priority: Major
> Attachments: HDFS-13442-HDFS-7240.001.patch
>
>
> If a datanode is not able to register itself, we need to handle that
> correctly.
> If the number of unsuccessful attempts to register with the SCM exceeds a
> configurable max number, the datanode should not make any more attempts.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]