keith-turner commented on issue #4090: URL: https://github.com/apache/accumulo/issues/4090#issuecomment-1865237124
> To provide metric(s) that are actionable you may need tablet info - and adding tablet info would drastically increase the cardinality of the metric - it would not be a good candidate for a tag. I don't think per tablet information is needed in the metrics system. Currently we have compaction services that bin tablets to different queues for compactions. We generate metrics on the number queued and and running for each queue. If a user sees that the number of queued tablets is too high for compaction service its an indication that further investigation is needed, but the metrics system does not provide the information needed to solve the question of why too many things are queued. Thinking along those same lines maybe each compaction service could have another metric for problems. Not sure if we need a metrics for each kinds of problem or just a general problem counter per compaction service. But this falls into that general idea of binning tablets into different categories and counting those. If the counts for any category seem off the metric system can not help you figure out why the counts are off, but it does indicate that action is needed. Below is an example of what I am thinking about. There are two compaction services each with two queues for running compactions. For each of the queues there are three metrics, the number of compactions currently running, the number queued, and the number that recently ran and failed. Also each compaction service has a count of the number of tablet where it had a problem planning compactions. This planning problem counter would cover this issue. I think these are reasonable counters. When someone sees failed compactions they have to go look at compactor process logs and try to figure out what failed an why. When someone sees planning failures they need to go look at the manager logs and see what happened. | Compaction service | Metric | count |-|-|-| | CS1 | small.running | 50 | | CS1 | small.queued | 500 | | CS1 | small.failed | 0 | | CS1 | large.running | 100 | | CS1 | large.queued | 10000 | | CS1 | large.failed | 10 | | CS1 | planning_problems | 20 | | CS2 | small.running | 30 | | CS2 | small.queued | 10 | | CS2 | small.failed | 0 | | CS2 | large.running | 500 | | CS2 | large.queued | 400 | | CS2 | large.failed | 0 | | CS2 | planning_problems | 0 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
