[jira] [Work logged] (TS-4834) Expose bad disk and disk access failures

ASF GitHub Bot (JIRA) Fri, 28 Oct 2016 12:55:48 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4834?focusedWorklogId=31305&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-31305
 ]


ASF GitHub Bot logged work on TS-4834:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Oct/16 19:54
            Start Date: 28/Oct/16 19:54
    Worklog Time Spent: 10m 
      Work Description: Github user SolidWallOfCode commented on a diff in the 
pull request:

    https://github.com/apache/trafficserver/pull/996#discussion_r85598778
  
    --- Diff: iocore/cache/CacheDisk.cc ---
    @@ -23,6 +23,35 @@
     
     #include "P_Cache.h"
     
    +void
    +CacheDisk::incrErrors(const AIOCallback *io)
    +{
    +  if (0 == this->num_errors) {
    +    /* This it the first read/write error on this span since ATS started.
    +     * Move the newly failing span from "online" to "failing" bucket. */
    +    RecIncrGlobalRawStat(cache_rsb, (int)(cache_span_online_stat), -1);
    +    RecIncrGlobalRawStat(cache_rsb, (int)(cache_span_failing_stat), 1);
    +  }
    +  this->num_errors++;
    +
    +  int opcode   = io->aiocb.aio_lio_opcode;
    +  char *opname = (char *)"unknown";
    --- End diff --
    
    Why not declare this as `const char*` instead of casting?


Issue Time Tracking
-------------------

    Worklog Id:     (was: 31305)
    Time Spent: 2h 10m  (was: 2h)

> Expose bad disk and disk access failures
> ----------------------------------------
>
>                 Key: TS-4834
>                 URL: https://issues.apache.org/jira/browse/TS-4834
>             Project: Traffic Server
>          Issue Type: Improvement
>          Components: Cache, Metrics
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>             Fix For: 7.1.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We would like to monitor low-level disk access failures and disks marked by 
> ATS as bad.
> I have a patch that exposes that information through
> {code}
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 5
> {code}
> and the following tests/shows how it would work...
> Start ATS with 2 disks and tail {{diags.log}}
> {code}
> $ cat etc/trafficserver/storage.config
> /dev/sdb
> /dev/sdc
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running
> [Sep  8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled
> {code}
> Check related metrics and observe all 0s
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 0
> proxy.process.cache.disk_bad_count 0
> proxy.process.cache.read.failure 0
> proxy.process.cache.write.failure 0
> proxy.process.cache.volume_0.read.failure 0
> proxy.process.cache.volume_0.write.failure 0
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now using your favorite hard disk failure injection tool inject 10 failures, 
> by setting both disks used by this setup {{/dev/sdb}} and {{/dev/sdc}} to 
> fail all reads. And shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o 
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [1/10]
> [Sep  8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [2/10]
> [Sep  8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [3/10]
> [Sep  8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [4/10]
> [Sep  8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [5/10]
> [Sep  8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [6/10]
> [Sep  8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdb [7/10]
> [Sep  8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [8/10]
> [Sep  8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdb [9/10]
> [Sep  8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors 
> accessing disk /dev/sdb [10/10]: declaring disk bad
> {code}
> We see 5 read failures which triggered 10 actually disk reads and marked the 
> failing disk as a bad disk.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 1
> proxy.process.cache.read.failure 5
> proxy.process.cache.write.failure 5
> proxy.process.cache.volume_0.read.failure 5
> proxy.process.cache.volume_0.write.failure 5
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o 
> /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [1/10]
> [Sep  8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [2/10]
> [Sep  8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [3/10]
> [Sep  8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [4/10]
> [Sep  8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [5/10]
> [Sep  8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [6/10]
> [Sep  8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk 
> /dev/sdc [7/10]
> [Sep  8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [8/10]
> [Sep  8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation 
> failed READ -1 0
> [Sep  8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk 
> /dev/sdc [9/10]
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors 
> accessing disk /dev/sdc [10/10]: declaring disk bad
> {code}
> We see 5 more read failures which triggered 10 more actual failed disk reads 
> and marked the second disk as bad.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" 
> "proxy.process.cache.*(read|write).failure" 
> "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 20
> proxy.process.cache.disk_bad_count 2
> proxy.process.cache.read.failure 10
> proxy.process.cache.write.failure 10
> proxy.process.cache.volume_0.read.failure 10
> proxy.process.cache.volume_0.write.failure 10
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> After marking all the (2) disks as bad finally ATS disabled the cache and 
> started going to origin on every requests.
> {code}
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices 
> offline, cache disabled
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work logged] (TS-4834) Expose bad disk and disk access failures

Reply via email to