[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics

ASF GitHub Bot (JIRA) Mon, 19 Sep 2016 11:52:07 -0700

     [ 
https://issues.apache.org/jira/browse/TS-4870?focusedWorklogId=29325&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29325
 ]


ASF GitHub Bot logged work on TS-4870:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Sep/16 18:51
            Start Date: 19/Sep/16 18:51
    Worklog Time Spent: 10m 
      Work Description: Github user gtenev commented on the issue:

    https://github.com/apache/trafficserver/pull/1028
  
    @jpeach, appreciate your feedback!
    
    It felt that "disk being offline" (might be an operator's decision) and 
"disk being bad" (number of IO errors reached a threshold) are better kept 
separate in general.
     
    IMHO using `CacheDisk::num_errors` to mark the disk offline could be error 
prone and here is an example.
    
    Let us say ``proxy.config.cache.max_disk_errors=5`` and a disk keeps 
failing causing ``handle_disk_failure()`` to be called and at some point 
``CacheDisk::num_errors`` becomes ``5``  which causes 
``mark_storage_offline()`` to be called. 
    
    At this point since ``CacheDisk::num_errors=5`` then ``true==DISK_BAD(d)``.
    
    It seems that if I did ``if(!DISK_BAD(d)) {...}`` (as suggested above) it 
would not execute the code in ``mark_storage_offline()`` at all, for instance 
``proxy.process.cache.bytes_total_stat`` would not get updated as it should.
    
    This is one of my first adventures in the "cache"component so I hope I am 
not missing something, please let me know what you think and will gladly 
look/test/change as necessary. 
    
    
    
    



Issue Time Tracking
-------------------

    Worklog Id:     (was: 29325)
    Time Spent: 1h 10m  (was: 1h)

> Storage can be marked offline multiple times which breaks related metrics
> -------------------------------------------------------------------------
>
>                 Key: TS-4870
>                 URL: https://issues.apache.org/jira/browse/TS-4870
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache, Metrics
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>             Fix For: 7.1.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Let us say traffic server is running with 2 disks
> {code}
> $ cat etc/trafficserver/storage.config
> /dev/sdb
> /dev/sdc
> $ sudo fdisk -l|grep 'Disk /dev/sd[b|c]'
> Disk /dev/sdb: 134 MB, 134217728 bytes
> Disk /dev/sdc: 134 MB, 134217728 bytes
> {code}
> Let us see what happens when we mark the same disk 3 times in a raw 
> ({{/dev/sdb}}) and check the {{proxy.node.cache.bytes_total}}.
> {code}
> # Initial cache size (when using both disks).
> $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total
> proxy.node.cache.bytes_total 268025856
> # Take 1st disk offline. Cache size changes as expected.
> $ sudo ./bin/traffic_ctl storage offline /dev/sdb
> $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total
> proxy.node.cache.bytes_total 134012928
> # Take same disk offline again. Not good!
> $ sudo ./bin/traffic_ctl storage offline /dev/sdb
> $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total
> proxy.node.cache.bytes_total 0
> # Take same disk offline again. Negative value.
> $ sudo ./bin/traffic_ctl storage offline /dev/sdb
> $ ./bin/traffic_ctl metric get proxy.node.cache.bytes_total
> proxy.node.cache.bytes_total -134012928
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work logged] (TS-4870) Storage can be marked offline multiple times which breaks related metrics

Reply via email to