[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs
[ https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656728#comment-13656728 ] Matt Foley commented on HDFS-3751: -- Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2. DN should log warnings for lengthy disk IOs --- Key: HDFS-3751 URL: https://issues.apache.org/jira/browse/HDFS-3751 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 1.2.0, 2.0.0-alpha Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Occasionally failing disks or other OS-and-below issues cause a single IO to take tens of seconds, or even minutes in the case of failures. This often results in timeout exceptions at the client side which are hard to diagnose. It would be easier to root-cause these issues if the DN logged a WARN like IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 61.3 seconds or somesuch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs
[ https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427320#comment-13427320 ] Robert Joseph Evans commented on HDFS-3751: --- If we are collecting this data to be able to output a warning it would be good to also keep metrics for each disk. This would potentially give us the ability in the future to have an admin look at the disk metrics and look for outliers. They could then investigate further and possible remove the failing disk. DN should log warnings for lengthy disk IOs --- Key: HDFS-3751 URL: https://issues.apache.org/jira/browse/HDFS-3751 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 1.2.0, 2.1.0-alpha Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Occasionally failing disks or other OS-and-below issues cause a single IO to take tens of seconds, or even minutes in the case of failures. This often results in timeout exceptions at the client side which are hard to diagnose. It would be easier to root-cause these issues if the DN logged a WARN like IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 61.3 seconds or somesuch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs
[ https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427421#comment-13427421 ] Todd Lipcon commented on HDFS-3751: --- Hey Bobby. We recently added metrics for these timings (HDFS-3170) and now calculate quantiles for them as well (HDFS-3650). I agree it would be nice to track them dynamically per mount, but I think that's a bit more complicated than the simple warning proposed here. We used a hacked up version of this proposed patch on a customer workload, and even the really simple logging was super helpful. Most people already have a way of grepping logs for certain key warning messages to trigger alerts, so even without Hadoop-side support for aggregating and counting the metrics, I think this should go in. Then let's file a separate JIRA to collect per-disk metrics using the metrics2 dynamic metrics support. DN should log warnings for lengthy disk IOs --- Key: HDFS-3751 URL: https://issues.apache.org/jira/browse/HDFS-3751 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 1.2.0, 2.1.0-alpha Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Occasionally failing disks or other OS-and-below issues cause a single IO to take tens of seconds, or even minutes in the case of failures. This often results in timeout exceptions at the client side which are hard to diagnose. It would be easier to root-cause these issues if the DN logged a WARN like IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 61.3 seconds or somesuch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs
[ https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427437#comment-13427437 ] Andrew Purtell commented on HDFS-3751: -- +1 Sounds like a simple change that would be really helpful. bq. Most people already have a way of grepping logs for certain key warning messages to trigger alerts, Often Splunk but with HADOOP-7705 could also use ElasticSearch, etc. DN should log warnings for lengthy disk IOs --- Key: HDFS-3751 URL: https://issues.apache.org/jira/browse/HDFS-3751 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 1.2.0, 2.1.0-alpha Reporter: Todd Lipcon Assignee: Colin Patrick McCabe Occasionally failing disks or other OS-and-below issues cause a single IO to take tens of seconds, or even minutes in the case of failures. This often results in timeout exceptions at the client side which are hard to diagnose. It would be easier to root-cause these issues if the DN logged a WARN like IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 61.3 seconds or somesuch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs
[ https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426937#comment-13426937 ] Colin Patrick McCabe commented on HDFS-3751: seems like a very good idea. Another good use for the monotonic timer. DN should log warnings for lengthy disk IOs --- Key: HDFS-3751 URL: https://issues.apache.org/jira/browse/HDFS-3751 Project: Hadoop HDFS Issue Type: Improvement Components: data-node Affects Versions: 1.2.0, 2.1.0-alpha Reporter: Todd Lipcon Assignee: Todd Lipcon Occasionally failing disks or other OS-and-below issues cause a single IO to take tens of seconds, or even minutes in the case of failures. This often results in timeout exceptions at the client side which are hard to diagnose. It would be easier to root-cause these issues if the DN logged a WARN like IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 61.3 seconds or somesuch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira