[jira] [Work logged] (TS-4834) Expose bad disk and disk access failures

ASF GitHub Bot (JIRA) Sat, 17 Sep 2016 11:45:06 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4834?focusedWorklogId=29276&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29276
 ]


ASF GitHub Bot logged work on TS-4834:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Sep/16 18:44
            Start Date: 17/Sep/16 18:44
    Worklog Time Spent: 10m 
      Work Description: Github user gtenev commented on the issue:

    https://github.com/apache/trafficserver/pull/996
  
    @jpeach and @SolidWallOfCode, really appreciate your feedback! Here is the 
new patch.
    
    Removed `_count`. The reason I added was that internally the "bad disk 
metric" and "disk errors" both had counter semantics and there are already 
metrics containing `_count`.
    
    Renamed `disk` to `span`, this seems more accurate indeed.
    
    I like the idea of exposing "online", "offline", etc. The new patch exposes 
the following metrics as gauges: 
    ```
    proxy.process.cache.span.failing
    proxy.process.cache.span.offline
    proxy.process.cache.span.online
    ```
     A span "moves" from "online" bucket (`errors==0`) to "failing" (`errors > 
0 && errors < proxy.config.cache.max_disk_errors`) to "offline" (`errors >= 
proxy.config.cache.max_disk_errors`).
    
    Please note that "failing" + "offline" + "online" = total number of spans.
    
    It was possible to split the read and write metrics so removed 
`proxy.process.cache.disk.errors` in favor of
    ```
    proxy.process.cache.span.errors.read
    proxy.process.cache.span.errors.write
    ```
    
    Removed the "per-volume" stats. @SolidWallOfCode, incrementing the metrics 
for all volumes on each failure does not make sense either. I would need to add 
more code to be able to increment the metrics for the right volume and I am not 
sure it is worth the effort (and maintenance).
    
    I think the idea of this change in general is to give the operator a signal 
that the disks need to be inspected and then the operator would diagnose them 
with more sophisticated lower level disk utilities.
    
    Noticed that the metrics would not get updated properly (would get somehow 
inconsistent) if there were failures during the cache initialization (before 
"cache enabled") and tried to fix wherever  noticed and able to test / 
reproduce.



Issue Time Tracking
-------------------

    Worklog Id:     (was: 29276)
    Time Spent: 1.5h  (was: 1h 20m)

> Expose bad disk and disk access failures
> ----------------------------------------
>
>                 Key: TS-4834
>                 URL: https://issues.apache.org/jira/browse/TS-4834
>             Project: Traffic Server
>          Issue Type: Improvement
>          Components: Cache, Metrics
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>             Fix For: 7.1.0
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We would like to monitor low-level disk access failures and disks marked by 
> ATS as bad.
> I have a patch that exposes that information through
> {code}
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 5
> {code}
> and the following tests/shows how it would work...
> Start ATS with 2 disks and tail {{diags.log}}
> {code}
> $ cat etc/trafficserver/storage.config
> /dev/sdb
> /dev/sdc
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running
> [Sep  8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled
> {code}
> Check related metrics and observe all 0s
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 0
> proxy.process.cache.disk_bad_count 0
> proxy.process.cache.read.failure 0
> proxy.process.cache.write.failure 0
> proxy.process.cache.volume_0.read.failure 0
> proxy.process.cache.volume_0.write.failure 0
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now using your favorite hard disk failure injection tool inject 10 failures, 
> by setting both disks used by this setup {{/dev/sdb}} and {{/dev/sdc}} to 
> fail all reads. And shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o 
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [1/10]
> [Sep  8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [2/10]
> [Sep  8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [3/10]
> [Sep  8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [4/10]
> [Sep  8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [5/10]
> [Sep  8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [6/10]
> [Sep  8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [7/10]
> [Sep  8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [8/10]
> [Sep  8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [9/10]
> [Sep  8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors 
> accessing disk /dev/sdb [10/10]: declaring disk bad
> {code}
> We see 5 read failures which triggered 10 actually disk reads and marked the 
> failing disk as a bad disk.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 1
> proxy.process.cache.read.failure 5
> proxy.process.cache.write.failure 5
> proxy.process.cache.volume_0.read.failure 5
> proxy.process.cache.volume_0.write.failure 5
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o 
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [1/10]
> [Sep  8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [2/10]
> [Sep  8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [3/10]
> [Sep  8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [4/10]
> [Sep  8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [5/10]
> [Sep  8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [6/10]
> [Sep  8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [7/10]
> [Sep  8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [8/10]
> [Sep  8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [9/10]
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors 
> accessing disk /dev/sdc [10/10]: declaring disk bad
> {code}
> We see 5 more read failures which triggered 10 more actual failed disk reads 
> and marked the second disk as bad.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 20
> proxy.process.cache.disk_bad_count 2
> proxy.process.cache.read.failure 10
> proxy.process.cache.write.failure 10
> proxy.process.cache.volume_0.read.failure 10
> proxy.process.cache.volume_0.write.failure 10
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> After marking all the (2) disks as bad finally ATS disabled the cache and 
> started going to origin on every requests.
> {code}
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices 
> offline, cache disabled
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work logged] (TS-4834) Expose bad disk and disk access failures

Reply via email to