[
https://issues.apache.org/jira/browse/HDFS-9087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902879#comment-14902879
]
Colin Patrick McCabe commented on HDFS-9087:
--------------------------------------------
Thanks for this, [~eclark].
So, the current patch adds a new configuration key,
{{hadoop.hdfs.datanode.checkdisk.interval}}. Do we really need to add this
configuration key? If so, it needs to go in {{hdfs-defaults.xml}}, needs to
have a constant defining its default value, needs to be documented, and so on.
Should the jitter be a fixed percentage of the period rather than the entire
period? Currently we have this:
{code}
427 this.checkDiskErrorInterval =
428 ThreadLocalRandom.current().nextInt(checkDiskPeriod) +
checkDiskPeriod;
{code}
Which could up to double the period. It might be better to limit the jitter to
25% or 50% of the period.
Also, it looks like in the current patch, one code path initializes
checkDiskErrorInterval to 5000 every time, whereas the other implements the
aforementioned jitter. Both code paths should set checkDiskErrorInterval the
same way.
> Add some jitter to DataNode.checkDiskErrorThread
> ------------------------------------------------
>
> Key: HDFS-9087
> URL: https://issues.apache.org/jira/browse/HDFS-9087
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0
> Reporter: Elliott Clark
> Assignee: Elliott Clark
> Attachments: HDFS-9087-v0.patch, HDFS-9087-v1.patch
>
>
> If all datanodes are started across a cluster at the same time (or errors in
> the network cause ioexceptions) there can be storms where lots of datanodes
> check their disks at the exact same time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)