[
https://issues.apache.org/jira/browse/TS-4834?focusedWorklogId=29276&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29276
]
ASF GitHub Bot logged work on TS-4834:
--------------------------------------
Author: ASF GitHub Bot
Created on: 17/Sep/16 18:44
Start Date: 17/Sep/16 18:44
Worklog Time Spent: 10m
Work Description: Github user gtenev commented on the issue:
https://github.com/apache/trafficserver/pull/996
@jpeach and @SolidWallOfCode, really appreciate your feedback! Here is the
new patch.
Removed `_count`. The reason I added was that internally the "bad disk
metric" and "disk errors" both had counter semantics and there are already
metrics containing `_count`.
Renamed `disk` to `span`, this seems more accurate indeed.
I like the idea of exposing "online", "offline", etc. The new patch exposes
the following metrics as gauges:
```
proxy.process.cache.span.failing
proxy.process.cache.span.offline
proxy.process.cache.span.online
```
A span "moves" from "online" bucket (`errors==0`) to "failing" (`errors >
0 && errors < proxy.config.cache.max_disk_errors`) to "offline" (`errors >=
proxy.config.cache.max_disk_errors`).
Please note that "failing" + "offline" + "online" = total number of spans.
It was possible to split the read and write metrics so removed
`proxy.process.cache.disk.errors` in favor of
```
proxy.process.cache.span.errors.read
proxy.process.cache.span.errors.write
```
Removed the "per-volume" stats. @SolidWallOfCode, incrementing the metrics
for all volumes on each failure does not make sense either. I would need to add
more code to be able to increment the metrics for the right volume and I am not
sure it is worth the effort (and maintenance).
I think the idea of this change in general is to give the operator a signal
that the disks need to be inspected and then the operator would diagnose them
with more sophisticated lower level disk utilities.
Noticed that the metrics would not get updated properly (would get somehow
inconsistent) if there were failures during the cache initialization (before
"cache enabled") and tried to fix wherever noticed and able to test /
reproduce.
Issue Time Tracking
-------------------
Worklog Id: (was: 29276)
Time Spent: 1.5h (was: 1h 20m)
> Expose bad disk and disk access failures
> ----------------------------------------
>
> Key: TS-4834
> URL: https://issues.apache.org/jira/browse/TS-4834
> Project: Traffic Server
> Issue Type: Improvement
> Components: Cache, Metrics
> Reporter: Gancho Tenev
> Assignee: Gancho Tenev
> Fix For: 7.1.0
>
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> We would like to monitor low-level disk access failures and disks marked by
> ATS as bad.
> I have a patch that exposes that information through
> {code}
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 5
> {code}
> and the following tests/shows how it would work...
> Start ATS with 2 disks and tail {{diags.log}}
> {code}
> $ cat etc/trafficserver/storage.config
> /dev/sdb
> /dev/sdc
> $ tail -f var/log/trafficserver/diags.log
> [Sep 8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running
> [Sep 8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled
> {code}
> Check related metrics and observe all 0s
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*"
> "proxy.process.cache.*(read|write).failure"
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 0
> proxy.process.cache.disk_bad_count 0
> proxy.process.cache.read.failure 0
> proxy.process.cache.write.failure 0
> proxy.process.cache.volume_0.read.failure 0
> proxy.process.cache.volume_0.write.failure 0
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now using your favorite hard disk failure injection tool inject 10 failures,
> by setting both disks used by this setup {{/dev/sdb}} and {{/dev/sdc}} to
> fail all reads. And shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep 8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdb [1/10]
> [Sep 8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdb [2/10]
> [Sep 8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdb [3/10]
> [Sep 8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdb [4/10]
> [Sep 8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdb [5/10]
> [Sep 8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdb [6/10]
> [Sep 8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdb [7/10]
> [Sep 8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdb [8/10]
> [Sep 8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdb [9/10]
> [Sep 8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors
> accessing disk /dev/sdb [10/10]: declaring disk bad
> {code}
> We see 5 read failures which triggered 10 actually disk reads and marked the
> failing disk as a bad disk.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*"
> "proxy.process.cache.*(read|write).failure"
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 1
> proxy.process.cache.read.failure 5
> proxy.process.cache.write.failure 5
> proxy.process.cache.volume_0.read.failure 5
> proxy.process.cache.volume_0.write.failure 5
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep 8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdc [1/10]
> [Sep 8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdc [2/10]
> [Sep 8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdc [3/10]
> [Sep 8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdc [4/10]
> [Sep 8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdc [5/10]
> [Sep 8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdc [6/10]
> [Sep 8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk
> /dev/sdc [7/10]
> [Sep 8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdc [8/10]
> [Sep 8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation
> failed READ -1 0
> [Sep 8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk
> /dev/sdc [9/10]
> [Sep 8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors
> accessing disk /dev/sdc [10/10]: declaring disk bad
> {code}
> We see 5 more read failures which triggered 10 more actual failed disk reads
> and marked the second disk as bad.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*"
> "proxy.process.cache.*(read|write).failure"
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 20
> proxy.process.cache.disk_bad_count 2
> proxy.process.cache.read.failure 10
> proxy.process.cache.write.failure 10
> proxy.process.cache.volume_0.read.failure 10
> proxy.process.cache.volume_0.write.failure 10
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> After marking all the (2) disks as bad finally ATS disabled the cache and
> started going to origin on every requests.
> {code}
> $ tail -f var/log/trafficserver/diags.log
> [Sep 8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices
> offline, cache disabled
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)