Github user gtenev commented on the issue:
    @jpeach and @SolidWallOfCode, really appreciate your feedback! Here is the 
new patch.
    Removed `_count`. The reason I added was that internally the "bad disk 
metric" and "disk errors" both had counter semantics and there are already 
metrics containing `_count`.
    Renamed `disk` to `span`, this seems more accurate indeed.
    I like the idea of exposing "online", "offline", etc. The new patch exposes 
the following metrics as gauges: 
     A span "moves" from "online" bucket (`errors==0`) to "failing" (`errors > 
0 && errors < proxy.config.cache.max_disk_errors`) to "offline" (`errors >= 
    Please note that "failing" + "offline" + "online" = total number of spans.
    It was possible to split the read and write metrics so removed 
`proxy.process.cache.disk.errors` in favor of
    Removed the "per-volume" stats. @SolidWallOfCode, incrementing the metrics 
for all volumes on each failure does not make sense either. I would need to add 
more code to be able to increment the metrics for the right volume and I am not 
sure it is worth the effort (and maintenance).
    I think the idea of this change in general is to give the operator a signal 
that the disks need to be inspected and then the operator would diagnose them 
with more sophisticated lower level disk utilities.
    Noticed that the metrics would not get updated properly (would get somehow 
inconsistent) if there were failures during the cache initialization (before 
"cache enabled") and tried to fix wherever  noticed and able to test / 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

Reply via email to