[ 
https://issues.apache.org/jira/browse/HBASE-29263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-29263:
---------------------------------
    Description: 
As of today, the procedure metrics we have include:
 * SubmittedCount: Counter
 * Time: Histogram
 * FailedCount: Counter

While the SubmittedCount is updated when the given procedure is submitted for 
execution, the Time histogram and FailedCount metrics are updated upon the 
termination of the procedures.

With recent incidents like HBASE-29251, we have realized that we don't have 
metrics to indicate long running or stuck procedures on which we can create 
alerts.

The purpose of this Jira is to introduce metrics for long running procedures. 
One possible way to introduce such metric is by a chore that can periodically 
look into how many procedures are currently being executed and have exceeded 
certain amount of configurable time duration.

  was:
As of today, the procedure metrics we have include:
 * 
SubmittedCount: Counter
 * 
Time: Histogram
 * 
FailedCount: Counter

While the SubmittedCount is updated when the given procedure is submitted for 
execution, the Time histogram and FailedCount metrics are updated upon the 
termination of the procedures.

With recent incidents like HBASE-29251, we have realized that we don't have 
metrics to indicate long running or stuck procedures on which we can create 
alerts.

The purpose of this Jira is to introduce metrics for long running procedures. 
One possible way to introduce such metric is by a chore that can periodically 
look into how many procedures are currently being executed and have exceeded 
certain amount of configurable time duration.


> Metrics for long running procedures
> -----------------------------------
>
>                 Key: HBASE-29263
>                 URL: https://issues.apache.org/jira/browse/HBASE-29263
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-beta-1, 2.5.11, 2.6.2
>            Reporter: Viraj Jasani
>            Assignee: Prathyusha
>            Priority: Major
>
> As of today, the procedure metrics we have include:
>  * SubmittedCount: Counter
>  * Time: Histogram
>  * FailedCount: Counter
> While the SubmittedCount is updated when the given procedure is submitted for 
> execution, the Time histogram and FailedCount metrics are updated upon the 
> termination of the procedures.
> With recent incidents like HBASE-29251, we have realized that we don't have 
> metrics to indicate long running or stuck procedures on which we can create 
> alerts.
> The purpose of this Jira is to introduce metrics for long running procedures. 
> One possible way to introduce such metric is by a chore that can periodically 
> look into how many procedures are currently being executed and have exceeded 
> certain amount of configurable time duration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to