[jira] [Work logged] (TS-4834) Expose bad disk and disk access failures

ASF GitHub Bot (JIRA) Fri, 09 Sep 2016 08:06:18 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4834?focusedWorklogId=28581&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-28581
 ]


ASF GitHub Bot logged work on TS-4834:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Sep/16 15:04
            Start Date: 09/Sep/16 15:04
    Worklog Time Spent: 10m 
      Work Description: Github user jpeach commented on the issue:

    https://github.com/apache/trafficserver/pull/996
  
    My comments from [TS-4834](https://issues.apache.org/jira/browse/TS-4834) ..
    
    >I think the nomenclature should mirror the existing metrics which 
typically don't include "count" in the names. Especially since the count of bad 
disks (gauge) does not have the same semantics as the count of disk errors 
(counter).
    
    Looking at the code I think the bad disk metric is a counter not a gauge.
    
    > Is "disk" the right nomenclature, or would "span" be more accurate?
    How feasible is it to have separate metrics for read and write errors? Does 
it make sense to also have metrics for disk read and write operations, or are 
system tools sufficient for that?
    If you have a count of bad disks, does it make sense to have other disk 
counts? ie. good disks, offline disks, etc.


Issue Time Tracking
-------------------

    Worklog Id:     (was: 28581)
    Time Spent: 40m  (was: 0.5h)

> Expose bad disk and disk access failures
> ----------------------------------------
>
>                 Key: TS-4834
>                 URL: https://issues.apache.org/jira/browse/TS-4834
>             Project: Traffic Server
>          Issue Type: Improvement
>          Components: Cache, Metrics
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>             Fix For: 7.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> We would like to monitor low-level disk access failures and disks marked by 
> ATS as bad.
> I have a patch that exposes that information through
> {code}
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 5
> {code}
> and the following tests/shows how it would work...
> Start ATS with 2 disks and tail {{diags.log}}
> {code}
> $ cat etc/trafficserver/storage.config
> /dev/sdb
> /dev/sdc
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running
> [Sep  8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled
> {code}
> Check related metrics and observe all 0s
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 0
> proxy.process.cache.disk_bad_count 0
> proxy.process.cache.read.failure 0
> proxy.process.cache.write.failure 0
> proxy.process.cache.volume_0.read.failure 0
> proxy.process.cache.volume_0.write.failure 0
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now using your favorite hard disk failure injection tool inject 10 failures, 
> by setting both disks used by this setup {{/dev/sdb}} and {{/dev/sdc}} to 
> fail all reads. And shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o 
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [1/10]
> [Sep  8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [2/10]
> [Sep  8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [3/10]
> [Sep  8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [4/10]
> [Sep  8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [5/10]
> [Sep  8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [6/10]
> [Sep  8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [7/10]
> [Sep  8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [8/10]
> [Sep  8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [9/10]
> [Sep  8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors 
> accessing disk /dev/sdb [10/10]: declaring disk bad
> {code}
> We see 5 read failures which triggered 10 actually disk reads and marked the 
> failing disk as a bad disk.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 1
> proxy.process.cache.read.failure 5
> proxy.process.cache.write.failure 5
> proxy.process.cache.volume_0.read.failure 5
> proxy.process.cache.volume_0.write.failure 5
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o 
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [1/10]
> [Sep  8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [2/10]
> [Sep  8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [3/10]
> [Sep  8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [4/10]
> [Sep  8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [5/10]
> [Sep  8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [6/10]
> [Sep  8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [7/10]
> [Sep  8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [8/10]
> [Sep  8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [9/10]
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors 
> accessing disk /dev/sdc [10/10]: declaring disk bad
> {code}
> We see 5 more read failures which triggered 10 more actual failed disk reads 
> and marked the second disk as bad.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 20
> proxy.process.cache.disk_bad_count 2
> proxy.process.cache.read.failure 10
> proxy.process.cache.write.failure 10
> proxy.process.cache.volume_0.read.failure 10
> proxy.process.cache.volume_0.write.failure 10
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> After marking all the (2) disks as bad finally ATS disabled the cache and 
> started going to origin on every requests.
> {code}
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices 
> offline, cache disabled
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work logged] (TS-4834) Expose bad disk and disk access failures

Reply via email to