[
https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427421#comment-13427421
]
Todd Lipcon commented on HDFS-3751:
-----------------------------------
Hey Bobby. We recently added metrics for these timings (HDFS-3170) and now
calculate quantiles for them as well (HDFS-3650). I agree it would be nice to
track them dynamically per mount, but I think that's a bit more complicated
than the simple warning proposed here.
We used a hacked up version of this proposed patch on a customer workload, and
even the really simple logging was super helpful. Most people already have a
way of grepping logs for certain key warning messages to trigger alerts, so
even without Hadoop-side support for aggregating and counting the metrics, I
think this should go in. Then let's file a separate JIRA to collect per-disk
metrics using the metrics2 dynamic metrics support.
> DN should log warnings for lengthy disk IOs
> -------------------------------------------
>
> Key: HDFS-3751
> URL: https://issues.apache.org/jira/browse/HDFS-3751
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node
> Affects Versions: 1.2.0, 2.1.0-alpha
> Reporter: Todd Lipcon
> Assignee: Colin Patrick McCabe
>
> Occasionally failing disks or other OS-and-below issues cause a single IO to
> take tens of seconds, or even minutes in the case of failures. This often
> results in timeout exceptions at the client side which are hard to diagnose.
> It would be easier to root-cause these issues if the DN logged a WARN like
> "IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took
> 61.3 seconds" or somesuch.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira