EdColeman commented on issue #946:
URL: https://github.com/apache/accumulo/issues/946#issuecomment-1061061376


   This was originally proposed as metrics / monitoring at a level such that 
operator and app developers could gain insight into overall health and trends.  
Having the threads throw exceptions is great. But, this was more directed to 
allowing monitoring and trending of higher level functions - things that could 
be using multiple threads.  @keith-turner provided some concrete examples. 
Knowing that the expected threads in the TabletGroupWatcher are running and 
possibly timing how long each run takes would allow metrics alerting and 
trending.
   
   This is speculation and more of an description of something desired rather 
than a concrete example that I know happens.  But assume that the thread 
handling user tablet assignments gets stuck or dies - if the manager keeps 
running then that is going to eventually be noticed through secondary effects - 
maybe its FATEs on table creates hang and backup or fail? Or its splits that 
start failing,...  Exposing that function as a reportable metric could allow 
intervention sooner - or maybe it could be trended and if the thread starts 
taking longer and longer to run one could look what has changed and fix 
something before it falls over.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to