Re: [I] Emit a metric for tablets that have more files than the scan max and are not compacting [accumulo]

via GitHub Wed, 20 Dec 2023 14:39:56 -0800


keith-turner commented on issue #4090:
URL: https://github.com/apache/accumulo/issues/4090#issuecomment-1865237124


   > To provide metric(s) that are actionable you may need tablet info - and 
adding tablet info would drastically increase the cardinality of the metric - 
it would not be a good candidate for a tag.
   
   I don't think per tablet information is needed in the metrics system.  
Currently we have compaction services that bin tablets to different queues for 
compactions.  We generate metrics on the number queued and and running for each 
queue.  If a user sees that the number of queued tablets is too high for  
compaction service its an indication that further investigation is needed, but 
the metrics system does not provide the information needed to solve the 
question of why too many things are queued.
   
   Thinking along those same lines maybe each compaction service could have 
another metric for problems. Not sure if we need a metrics for each kinds of 
problem or just a general problem counter per compaction service.  But this 
falls into that general idea of binning tablets into different categories and 
counting those.  If the counts for any category seem off the metric system can 
not help you figure out why the counts are off, but it does indicate that 
action is needed.
   
   Below is an example of what I am thinking about.  There are two compaction 
services each with two queues for running compactions.  For each of the queues 
there are three metrics, the number of compactions currently running, the 
number queued, and the number that recently ran and failed.  Also each 
compaction service has a count of the number of tablet where it had a problem 
planning compactions. This planning problem counter would cover this issue.  I 
think these are reasonable counters.  When someone sees failed compactions they 
have to go look at compactor process logs and try to figure out what failed an 
why.  When someone sees planning failures they need to go look at the manager 
logs and see what happened.
   
   | Compaction service | Metric | count
   |-|-|-|
   | CS1 | small.running | 50 |
   | CS1 | small.queued | 500 |
   | CS1 | small.failed | 0 |
   | CS1 | large.running | 100 |
   | CS1 | large.queued | 10000 |
   | CS1 | large.failed | 10 |
   | CS1 | planning_problems | 20 |
   | CS2 | small.running | 30 |
   | CS2 | small.queued | 10 |
   | CS2 | small.failed | 0 |
   | CS2 | large.running | 500 |
   | CS2 | large.queued | 400 |
   | CS2 | large.failed | 0 |
   | CS2 | planning_problems | 0 |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Emit a metric for tablets that have more files than the scan max and are not compacting [accumulo]

Reply via email to