[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs

2013-05-13 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13656728#comment-13656728
 ] 

Matt Foley commented on HDFS-3751:
--

Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 
if you intend to submit a fix for branch-1.2.

 DN should log warnings for lengthy disk IOs
 ---

 Key: HDFS-3751
 URL: https://issues.apache.org/jira/browse/HDFS-3751
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 1.2.0, 2.0.0-alpha
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe

 Occasionally failing disks or other OS-and-below issues cause a single IO to 
 take tens of seconds, or even minutes in the case of failures. This often 
 results in timeout exceptions at the client side which are hard to diagnose. 
 It would be easier to root-cause these issues if the DN logged a WARN like 
 IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 
 61.3 seconds or somesuch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs

2012-08-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427320#comment-13427320
 ] 

Robert Joseph Evans commented on HDFS-3751:
---

If we are collecting this data to be able to output a warning it would be good 
to also keep metrics for each disk.  This would potentially give us the ability 
in the future to have an admin look at the disk metrics and look for outliers.  
They could then investigate further and possible remove the failing disk.

 DN should log warnings for lengthy disk IOs
 ---

 Key: HDFS-3751
 URL: https://issues.apache.org/jira/browse/HDFS-3751
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 1.2.0, 2.1.0-alpha
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe

 Occasionally failing disks or other OS-and-below issues cause a single IO to 
 take tens of seconds, or even minutes in the case of failures. This often 
 results in timeout exceptions at the client side which are hard to diagnose. 
 It would be easier to root-cause these issues if the DN logged a WARN like 
 IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 
 61.3 seconds or somesuch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs

2012-08-02 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427421#comment-13427421
 ] 

Todd Lipcon commented on HDFS-3751:
---

Hey Bobby. We recently added metrics for these timings (HDFS-3170) and now 
calculate quantiles for them as well (HDFS-3650). I agree it would be nice to 
track them dynamically per mount, but I think that's a bit more complicated 
than the simple warning proposed here.

We used a hacked up version of this proposed patch on a customer workload, and 
even the really simple logging was super helpful. Most people already have a 
way of grepping logs for certain key warning messages to trigger alerts, so 
even without Hadoop-side support for aggregating and counting the metrics, I 
think this should go in. Then let's file a separate JIRA to collect per-disk 
metrics using the metrics2 dynamic metrics support.

 DN should log warnings for lengthy disk IOs
 ---

 Key: HDFS-3751
 URL: https://issues.apache.org/jira/browse/HDFS-3751
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 1.2.0, 2.1.0-alpha
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe

 Occasionally failing disks or other OS-and-below issues cause a single IO to 
 take tens of seconds, or even minutes in the case of failures. This often 
 results in timeout exceptions at the client side which are hard to diagnose. 
 It would be easier to root-cause these issues if the DN logged a WARN like 
 IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 
 61.3 seconds or somesuch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs

2012-08-02 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427437#comment-13427437
 ] 

Andrew Purtell commented on HDFS-3751:
--

+1 Sounds like a simple change that would be really helpful.

bq. Most people already have a way of grepping logs for certain key warning 
messages to trigger alerts,

Often Splunk but with HADOOP-7705 could also use ElasticSearch, etc.

 DN should log warnings for lengthy disk IOs
 ---

 Key: HDFS-3751
 URL: https://issues.apache.org/jira/browse/HDFS-3751
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 1.2.0, 2.1.0-alpha
Reporter: Todd Lipcon
Assignee: Colin Patrick McCabe

 Occasionally failing disks or other OS-and-below issues cause a single IO to 
 take tens of seconds, or even minutes in the case of failures. This often 
 results in timeout exceptions at the client side which are hard to diagnose. 
 It would be easier to root-cause these issues if the DN logged a WARN like 
 IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 
 61.3 seconds or somesuch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs

2012-08-01 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426937#comment-13426937
 ] 

Colin Patrick McCabe commented on HDFS-3751:


seems like a very good idea.  Another good use for the monotonic timer.

 DN should log warnings for lengthy disk IOs
 ---

 Key: HDFS-3751
 URL: https://issues.apache.org/jira/browse/HDFS-3751
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: data-node
Affects Versions: 1.2.0, 2.1.0-alpha
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 Occasionally failing disks or other OS-and-below issues cause a single IO to 
 take tens of seconds, or even minutes in the case of failures. This often 
 results in timeout exceptions at the client side which are hard to diagnose. 
 It would be easier to root-cause these issues if the DN logged a WARN like 
 IO of 64kb to volume /data/1/dfs/dn for block 12345234 client 1.2.3.4 took 
 61.3 seconds or somesuch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira