Github user gtenev commented on the issue:
@jpeach and @SolidWallOfCode, really appreciate your feedback! Here is the
Removed `_count`. The reason I added was that internally the "bad disk
metric" and "disk errors" both had counter semantics and there are already
metrics containing `_count`.
Renamed `disk` to `span`, this seems more accurate indeed.
I like the idea of exposing "online", "offline", etc. The new patch exposes
the following metrics as gauges:
A span "moves" from "online" bucket (`errors==0`) to "failing" (`errors >
0 && errors < proxy.config.cache.max_disk_errors`) to "offline" (`errors >=
Please note that "failing" + "offline" + "online" = total number of spans.
It was possible to split the read and write metrics so removed
`proxy.process.cache.disk.errors` in favor of
Removed the "per-volume" stats. @SolidWallOfCode, incrementing the metrics
for all volumes on each failure does not make sense either. I would need to add
more code to be able to increment the metrics for the right volume and I am not
sure it is worth the effort (and maintenance).
I think the idea of this change in general is to give the operator a signal
that the disks need to be inspected and then the operator would diagnose them
with more sophisticated lower level disk utilities.
Noticed that the metrics would not get updated properly (would get somehow
inconsistent) if there were failures during the cache initialization (before
"cache enabled") and tried to fix wherever noticed and able to test /
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket