[ 
https://issues.apache.org/jira/browse/HDDS-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Pal updated HDDS-11680:
--------------------------------
    Description: 
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed 
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
This is used for for ICR/FCR both.

!image-2024-11-25-14-19-58-883.png|width=808,height=146!

So if any ICR/FCR events are in queue, we'll know container reports are still 
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing 
upto date data.

Like this we have for pipeline reports as well. 

!image-2024-11-25-14-26-32-614.png|width=724,height=158!

Add new prometheus metrics for improved observability in Recon:

For both OM and SCM metadata background tasks:
 * lastRunStatus
 ** 0 - last run task status was success
 ** -1 - last run task status was fail.

 

Changes/modification in *RECON_TASK_STATUS* table:
 * Update the *last_updated_seq_num* value for all tasks in the 
RECON_TASK_STATUS table.
 * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table to 
track last instance run status.
 * Add a new Map between task name and the number of failures and passes. This 
map will be maintained for a configurable amount of time. After the time period 
is over reinitialize the map and start storing new pass/fail counts
* Add a configuration value to store this timeout duration, with a default of 
30mins.

  was:
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed 
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
This is used for for ICR/FCR both.

!image-2024-11-25-14-19-58-883.png|width=808,height=146!

So if any ICR/FCR events are in queue, we'll know container reports are still 
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing 
upto date data.

Like this we have for pipeline reports as well. 

!image-2024-11-25-14-26-32-614.png|width=724,height=158!

Add new prometheus metrics for improved observability in Recon:

For both OM and SCM metadata background tasks:
 * lastRunStatus
 ** 0 - last run task status was success
 ** -1 - last run task status was fail.

 

Changes/modification in *RECON_TASK_STATUS* table:
 * Update the *last_updated_seq_num* value for all tasks in the 
RECON_TASK_STATUS table.
 * Add a new column as “{*}task_id{*}” in the *RECON_TASK_STATUS* table for 
each task instance when it starts. This way, we can keep track of each instance 
of task run and can differentiate between each task run instead of just 
updating and overwriting the same task run instance.
 * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table to 
track each instance run status. Tracking each run of task instances, we can 
show some important information about each task like how many times a task 
failed/succeeded in the last X units of time.


> Ozone Recon - Enhance Prometheus Metrics For Improved Observability
> -------------------------------------------------------------------
>
>                 Key: HDDS-11680
>                 URL: https://issues.apache.org/jira/browse/HDDS-11680
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Recon
>            Reporter: Devesh Kumar Singh
>            Assignee: Abhishek Pal
>            Priority: Major
>         Attachments: image-2024-11-25-14-19-56-217.png, 
> image-2024-11-25-14-19-58-883.png, image-2024-11-25-14-26-32-614.png
>
>
> For SCM metadata background tasks, we can leverage ICR and FCR metrics 
> exposed by 
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
> This is used for for ICR/FCR both.
> !image-2024-11-25-14-19-58-883.png|width=808,height=146!
> So if any ICR/FCR events are in queue, we'll know container reports are still 
> to be processed and ContainerHealthTask and PipelineSyncTask may not be 
> showing upto date data.
> Like this we have for pipeline reports as well. 
> !image-2024-11-25-14-26-32-614.png|width=724,height=158!
> Add new prometheus metrics for improved observability in Recon:
> For both OM and SCM metadata background tasks:
>  * lastRunStatus
>  ** 0 - last run task status was success
>  ** -1 - last run task status was fail.
>  
> Changes/modification in *RECON_TASK_STATUS* table:
>  * Update the *last_updated_seq_num* value for all tasks in the 
> RECON_TASK_STATUS table.
>  * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table 
> to track last instance run status.
>  * Add a new Map between task name and the number of failures and passes. 
> This map will be maintained for a configurable amount of time. After the time 
> period is over reinitialize the map and start storing new pass/fail counts
> * Add a configuration value to store this timeout duration, with a default of 
> 30mins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to