[ 
https://issues.apache.org/jira/browse/HDFS-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902879#comment-14902879
 ] 

Colin Patrick McCabe commented on HDFS-9087:
--------------------------------------------

Thanks for this, [~eclark].

So, the current patch adds a new configuration key, 
{{hadoop.hdfs.datanode.checkdisk.interval}}.  Do we really need to add this 
configuration key?  If so, it needs to go in {{hdfs-defaults.xml}}, needs to 
have a constant defining its default value, needs to be documented, and so on.

Should the jitter be a fixed percentage of the period rather than the entire 
period? Currently we have this:
{code}
427    this.checkDiskErrorInterval =
428             ThreadLocalRandom.current().nextInt(checkDiskPeriod) + 
checkDiskPeriod;
{code}
Which could up to double the period.  It might be better to limit the jitter to 
25% or 50% of the period.

Also, it looks like in the current patch, one code path initializes 
checkDiskErrorInterval to 5000 every time, whereas the other implements the 
aforementioned jitter.  Both code paths should set checkDiskErrorInterval the 
same way.

> Add some jitter to DataNode.checkDiskErrorThread
> ------------------------------------------------
>
>                 Key: HDFS-9087
>                 URL: https://issues.apache.org/jira/browse/HDFS-9087
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>         Attachments: HDFS-9087-v0.patch, HDFS-9087-v1.patch
>
>
> If all datanodes are started across a cluster at the same time (or errors in 
> the network cause ioexceptions) there can be storms where lots of datanodes 
> check their disks at the exact same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to