[jira] [Updated] (HDDS-11680) Ozone Recon - Enhance Prometheus Metrics For Improved Observability

Devesh Kumar Singh (Jira) Tue, 03 Dec 2024 02:40:48 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devesh Kumar Singh updated HDDS-11680:
--------------------------------------
    Description: 
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed 
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
This is used for for ICR/FCR both.

!image-2024-11-25-14-19-58-883.png|width=808,height=146!

So if any ICR/FCR events are in queue, we'll know container reports are still 
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing 
upto date data.

Like this we have for pipeline reports as well. 

!image-2024-11-25-14-26-32-614.png|width=724,height=158!

Add new prometheus metrics for improved observability in Recon:

For both OM and SCM metadata background tasks:
 * lastRunStatus
 ** 0 - last run task status was success
 ** -1 - last run task status was fail.

 

Changes/modification in *RECON_TASK_STATUS* table:
 * Update the *last_updated_seq_num* value for all tasks in the 
RECON_TASK_STATUS table.
 * Add a new column as “{*}last_run_{*}{*}task_status{*}” in the 
*RECON_TASK_STATUS* table to track last instance run status.
 * Add another column as "{*}current_task_status{*}" in the *RECON_TASK_STATUS* 
table to track current run status of task.

  was:
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed 
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
This is used for for ICR/FCR both.

!image-2024-11-25-14-19-58-883.png|width=808,height=146!

So if any ICR/FCR events are in queue, we'll know container reports are still 
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing 
upto date data.

Like this we have for pipeline reports as well. 

!image-2024-11-25-14-26-32-614.png|width=724,height=158!

Add new prometheus metrics for improved observability in Recon:

For both OM and SCM metadata background tasks:
 * lastRunStatus
 ** 0 - last run task status was success
 ** -1 - last run task status was fail.

 

Changes/modification in *RECON_TASK_STATUS* table:
 * Update the *last_updated_seq_num* value for all tasks in the 
RECON_TASK_STATUS table.
 * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table to 
track last instance run status.
 * Add a new Map between task name and the number of failures and passes. This 
map will be maintained for a configurable amount of time. After the time period 
is over reinitialize the map and start storing new pass/fail counts
* Add a configuration value to store this timeout duration, with a default of 
30mins.


> Ozone Recon - Enhance Prometheus Metrics For Improved Observability
> -------------------------------------------------------------------
>
>                 Key: HDDS-11680
>                 URL: https://issues.apache.org/jira/browse/HDDS-11680
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Recon
>            Reporter: Devesh Kumar Singh
>            Assignee: Abhishek Pal
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2024-11-25-14-19-56-217.png, 
> image-2024-11-25-14-19-58-883.png, image-2024-11-25-14-26-32-614.png
>
>
> For SCM metadata background tasks, we can leverage ICR and FCR metrics 
> exposed by 
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
> This is used for for ICR/FCR both.
> !image-2024-11-25-14-19-58-883.png|width=808,height=146!
> So if any ICR/FCR events are in queue, we'll know container reports are still 
> to be processed and ContainerHealthTask and PipelineSyncTask may not be 
> showing upto date data.
> Like this we have for pipeline reports as well. 
> !image-2024-11-25-14-26-32-614.png|width=724,height=158!
> Add new prometheus metrics for improved observability in Recon:
> For both OM and SCM metadata background tasks:
>  * lastRunStatus
>  ** 0 - last run task status was success
>  ** -1 - last run task status was fail.
>  
> Changes/modification in *RECON_TASK_STATUS* table:
>  * Update the *last_updated_seq_num* value for all tasks in the 
> RECON_TASK_STATUS table.
>  * Add a new column as “{*}last_run_{*}{*}task_status{*}” in the 
> *RECON_TASK_STATUS* table to track last instance run status.
>  * Add another column as "{*}current_task_status{*}" in the 
> *RECON_TASK_STATUS* table to track current run status of task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11680) Ozone Recon - Enhance Prometheus Metrics For Improved Observability

Reply via email to