[ 
https://issues.apache.org/jira/browse/HDDS-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-11680:
----------------------------------
    Labels: pull-request-available  (was: )

> Ozone Recon - Enhance Prometheus Metrics For Improved Observability
> -------------------------------------------------------------------
>
>                 Key: HDDS-11680
>                 URL: https://issues.apache.org/jira/browse/HDDS-11680
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Recon
>            Reporter: Devesh Kumar Singh
>            Assignee: Abhishek Pal
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2024-11-25-14-19-56-217.png, 
> image-2024-11-25-14-19-58-883.png, image-2024-11-25-14-26-32-614.png
>
>
> For SCM metadata background tasks, we can leverage ICR and FCR metrics 
> exposed by 
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
> This is used for for ICR/FCR both.
> !image-2024-11-25-14-19-58-883.png|width=808,height=146!
> So if any ICR/FCR events are in queue, we'll know container reports are still 
> to be processed and ContainerHealthTask and PipelineSyncTask may not be 
> showing upto date data.
> Like this we have for pipeline reports as well. 
> !image-2024-11-25-14-26-32-614.png|width=724,height=158!
> Add new prometheus metrics for improved observability in Recon:
> For both OM and SCM metadata background tasks:
>  * lastRunStatus
>  ** 0 - last run task status was success
>  ** -1 - last run task status was fail.
>  
> Changes/modification in *RECON_TASK_STATUS* table:
>  * Update the *last_updated_seq_num* value for all tasks in the 
> RECON_TASK_STATUS table.
>  * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table 
> to track last instance run status.
>  * Add a new Map between task name and the number of failures and passes. 
> This map will be maintained for a configurable amount of time. After the time 
> period is over reinitialize the map and start storing new pass/fail counts
> * Add a configuration value to store this timeout duration, with a default of 
> 30mins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to