[
https://issues.apache.org/jira/browse/HDDS-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Devesh Kumar Singh updated HDDS-11680:
--------------------------------------
Description:
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor.
This is used for for ICR/FCR both.
!image-2024-11-25-14-19-58-883.png|width=808,height=146!
So if any ICR/FCR events are in queue, we'll know container reports are still
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing
upto date data.
Like this we have for pipeline reports as well.
!image-2024-11-25-14-26-32-614.png|width=724,height=158!
Add new prometheus metrics for improved observability in Recon:
For both OM and SCM metadata background tasks:
* lastRunStatus
** 0 - last run task status was success
** -1 - last run task status was fail.
Changes/modification in *RECON_TASK_STATUS* table:
* Update the *last_updated_seq_num* value for all tasks in the
RECON_TASK_STATUS table.
* Add a new column as “{*}last_run_{*}{*}task_status{*}” in the
*RECON_TASK_STATUS* table to track last instance run status.
* Add another column as "{*}current_task_status{*}" in the *RECON_TASK_STATUS*
table to track current run status of task.
was:
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor.
This is used for for ICR/FCR both.
!image-2024-11-25-14-19-58-883.png|width=808,height=146!
So if any ICR/FCR events are in queue, we'll know container reports are still
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing
upto date data.
Like this we have for pipeline reports as well.
!image-2024-11-25-14-26-32-614.png|width=724,height=158!
Add new prometheus metrics for improved observability in Recon:
For both OM and SCM metadata background tasks:
* lastRunStatus
** 0 - last run task status was success
** -1 - last run task status was fail.
Changes/modification in *RECON_TASK_STATUS* table:
* Update the *last_updated_seq_num* value for all tasks in the
RECON_TASK_STATUS table.
* Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table to
track last instance run status.
* Add a new Map between task name and the number of failures and passes. This
map will be maintained for a configurable amount of time. After the time period
is over reinitialize the map and start storing new pass/fail counts
* Add a configuration value to store this timeout duration, with a default of
30mins.
> Ozone Recon - Enhance Prometheus Metrics For Improved Observability
> -------------------------------------------------------------------
>
> Key: HDDS-11680
> URL: https://issues.apache.org/jira/browse/HDDS-11680
> Project: Apache Ozone
> Issue Type: Task
> Components: Ozone Recon
> Reporter: Devesh Kumar Singh
> Assignee: Abhishek Pal
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2024-11-25-14-19-56-217.png,
> image-2024-11-25-14-19-58-883.png, image-2024-11-25-14-26-32-614.png
>
>
> For SCM metadata background tasks, we can leverage ICR and FCR metrics
> exposed by
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor.
> This is used for for ICR/FCR both.
> !image-2024-11-25-14-19-58-883.png|width=808,height=146!
> So if any ICR/FCR events are in queue, we'll know container reports are still
> to be processed and ContainerHealthTask and PipelineSyncTask may not be
> showing upto date data.
> Like this we have for pipeline reports as well.
> !image-2024-11-25-14-26-32-614.png|width=724,height=158!
> Add new prometheus metrics for improved observability in Recon:
> For both OM and SCM metadata background tasks:
> * lastRunStatus
> ** 0 - last run task status was success
> ** -1 - last run task status was fail.
>
> Changes/modification in *RECON_TASK_STATUS* table:
> * Update the *last_updated_seq_num* value for all tasks in the
> RECON_TASK_STATUS table.
> * Add a new column as “{*}last_run_{*}{*}task_status{*}” in the
> *RECON_TASK_STATUS* table to track last instance run status.
> * Add another column as "{*}current_task_status{*}" in the
> *RECON_TASK_STATUS* table to track current run status of task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]