[
https://issues.apache.org/jira/browse/HBASE-29263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani updated HBASE-29263:
---------------------------------
Description:
As of today, the procedure metrics we have include:
* SubmittedCount: Counter
* Time: Histogram
* FailedCount: Counter
While the SubmittedCount is updated when the given procedure is submitted for
execution, the Time histogram and FailedCount metrics are updated upon the
termination of the procedures.
With recent incidents like HBASE-29251, we have realized that we don't have
metrics to indicate long running or stuck procedures on which we can create
alerts.
The purpose of this Jira is to introduce metrics for long running procedures.
One possible way to introduce such metric is by a chore that can periodically
look into how many procedures are currently being executed and have exceeded
certain amount of configurable time duration.
was:
As of today, the procedure metrics we have include:
*
SubmittedCount: Counter
*
Time: Histogram
*
FailedCount: Counter
While the SubmittedCount is updated when the given procedure is submitted for
execution, the Time histogram and FailedCount metrics are updated upon the
termination of the procedures.
With recent incidents like HBASE-29251, we have realized that we don't have
metrics to indicate long running or stuck procedures on which we can create
alerts.
The purpose of this Jira is to introduce metrics for long running procedures.
One possible way to introduce such metric is by a chore that can periodically
look into how many procedures are currently being executed and have exceeded
certain amount of configurable time duration.
> Metrics for long running procedures
> -----------------------------------
>
> Key: HBASE-29263
> URL: https://issues.apache.org/jira/browse/HBASE-29263
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 3.0.0-beta-1, 2.5.11, 2.6.2
> Reporter: Viraj Jasani
> Assignee: Prathyusha
> Priority: Major
>
> As of today, the procedure metrics we have include:
> * SubmittedCount: Counter
> * Time: Histogram
> * FailedCount: Counter
> While the SubmittedCount is updated when the given procedure is submitted for
> execution, the Time histogram and FailedCount metrics are updated upon the
> termination of the procedures.
> With recent incidents like HBASE-29251, we have realized that we don't have
> metrics to indicate long running or stuck procedures on which we can create
> alerts.
> The purpose of this Jira is to introduce metrics for long running procedures.
> One possible way to introduce such metric is by a chore that can periodically
> look into how many procedures are currently being executed and have exceeded
> certain amount of configurable time duration.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)