[
https://issues.apache.org/jira/browse/HDDS-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-11680:
----------------------------------
Labels: pull-request-available (was: )
> Ozone Recon - Enhance Prometheus Metrics For Improved Observability
> -------------------------------------------------------------------
>
> Key: HDDS-11680
> URL: https://issues.apache.org/jira/browse/HDDS-11680
> Project: Apache Ozone
> Issue Type: Task
> Components: Ozone Recon
> Reporter: Devesh Kumar Singh
> Assignee: Abhishek Pal
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2024-11-25-14-19-56-217.png,
> image-2024-11-25-14-19-58-883.png, image-2024-11-25-14-26-32-614.png
>
>
> For SCM metadata background tasks, we can leverage ICR and FCR metrics
> exposed by
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor.
> This is used for for ICR/FCR both.
> !image-2024-11-25-14-19-58-883.png|width=808,height=146!
> So if any ICR/FCR events are in queue, we'll know container reports are still
> to be processed and ContainerHealthTask and PipelineSyncTask may not be
> showing upto date data.
> Like this we have for pipeline reports as well.
> !image-2024-11-25-14-26-32-614.png|width=724,height=158!
> Add new prometheus metrics for improved observability in Recon:
> For both OM and SCM metadata background tasks:
> * lastRunStatus
> ** 0 - last run task status was success
> ** -1 - last run task status was fail.
>
> Changes/modification in *RECON_TASK_STATUS* table:
> * Update the *last_updated_seq_num* value for all tasks in the
> RECON_TASK_STATUS table.
> * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table
> to track last instance run status.
> * Add a new Map between task name and the number of failures and passes.
> This map will be maintained for a configurable amount of time. After the time
> period is over reinitialize the map and start storing new pass/fail counts
> * Add a configuration value to store this timeout duration, with a default of
> 30mins.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]