[ https://issues.apache.org/jira/browse/TS-4834?focusedWorklogId=28792&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-28792 ]
ASF GitHub Bot logged work on TS-4834: -------------------------------------- Author: ASF GitHub Bot Created on: 12/Sep/16 16:37 Start Date: 12/Sep/16 16:37 Worklog Time Spent: 10m Work Description: Github user jpeach commented on the issue: https://github.com/apache/trafficserver/pull/996 @gtenev See my comments here and in the bug. I think that these new metrics should follow the existing conventions of the cache system. Issue Time Tracking ------------------- Worklog Id: (was: 28792) Time Spent: 1h (was: 50m) > Expose bad disk and disk access failures > ---------------------------------------- > > Key: TS-4834 > URL: https://issues.apache.org/jira/browse/TS-4834 > Project: Traffic Server > Issue Type: Improvement > Components: Cache, Metrics > Reporter: Gancho Tenev > Assignee: Gancho Tenev > Fix For: 7.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > We would like to monitor low-level disk access failures and disks marked by > ATS as bad. > I have a patch that exposes that information through > {code} > proxy.process.cache.disk_error_count 10 > proxy.process.cache.disk_bad_count 5 > {code} > and the following tests/shows how it would work... > Start ATS with 2 disks and tail {{diags.log}} > {code} > $ cat etc/trafficserver/storage.config > /dev/sdb > /dev/sdc > $ tail -f var/log/trafficserver/diags.log > [Sep 8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running > [Sep 8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled > {code} > Check related metrics and observe all 0s > {code} > $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" > "proxy.process.cache.*(read|write).failure" > "proxy.process.http.cache_(read|write)_errors" > proxy.process.cache.disk_error_count 0 > proxy.process.cache.disk_bad_count 0 > proxy.process.cache.read.failure 0 > proxy.process.cache.write.failure 0 > proxy.process.cache.volume_0.read.failure 0 > proxy.process.cache.volume_0.write.failure 0 > proxy.process.http.cache_write_errors 0 > proxy.process.http.cache_read_errors 0 > {code} > Now using your favorite hard disk failure injection tool inject 10 failures, > by setting both disks used by this setup {{/dev/sdb}} and {{/dev/sdc}} to > fail all reads. And shoot 5 requests causing 10 failed reads. > {code} > $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o > /dev/null -s; done > $ tail -f var/log/trafficserver/diags.log > [Sep 8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdb [1/10] > [Sep 8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdb [2/10] > [Sep 8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdb [3/10] > [Sep 8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdb [4/10] > [Sep 8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdb [5/10] > [Sep 8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdb [6/10] > [Sep 8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdb [7/10] > [Sep 8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdb [8/10] > [Sep 8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdb [9/10] > [Sep 8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors > accessing disk /dev/sdb [10/10]: declaring disk bad > {code} > We see 5 read failures which triggered 10 actually disk reads and marked the > failing disk as a bad disk. > {code} > $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" > "proxy.process.cache.*(read|write).failure" > "proxy.process.http.cache_(read|write)_errors" > proxy.process.cache.disk_error_count 10 > proxy.process.cache.disk_bad_count 1 > proxy.process.cache.read.failure 5 > proxy.process.cache.write.failure 5 > proxy.process.cache.volume_0.read.failure 5 > proxy.process.cache.volume_0.write.failure 5 > proxy.process.http.cache_write_errors 0 > proxy.process.http.cache_read_errors 0 > {code} > Now shoot 5 requests causing 10 failed reads. > {code} > $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o > /dev/null -s; done > $ tail -f var/log/trafficserver/diags.log > [Sep 8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdc [1/10] > [Sep 8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdc [2/10] > [Sep 8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdc [3/10] > [Sep 8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdc [4/10] > [Sep 8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdc [5/10] > [Sep 8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdc [6/10] > [Sep 8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk > /dev/sdc [7/10] > [Sep 8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdc [8/10] > [Sep 8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation > failed READ -1 0 > [Sep 8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk > /dev/sdc [9/10] > [Sep 8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors > accessing disk /dev/sdc [10/10]: declaring disk bad > {code} > We see 5 more read failures which triggered 10 more actual failed disk reads > and marked the second disk as bad. > {code} > $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" > "proxy.process.cache.*(read|write).failure" > "proxy.process.http.cache_(read|write)_errors" > proxy.process.cache.disk_error_count 20 > proxy.process.cache.disk_bad_count 2 > proxy.process.cache.read.failure 10 > proxy.process.cache.write.failure 10 > proxy.process.cache.volume_0.read.failure 10 > proxy.process.cache.volume_0.write.failure 10 > proxy.process.http.cache_write_errors 0 > proxy.process.http.cache_read_errors 0 > {code} > After marking all the (2) disks as bad finally ATS disabled the cache and > started going to origin on every requests. > {code} > $ tail -f var/log/trafficserver/diags.log > [Sep 8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices > offline, cache disabled > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)