[jira] [Updated] (HDDS-11680) Ozone Recon - Enhance Prometheus Metrics For Improved Observability

Devesh Kumar Singh (Jira) Mon, 25 Nov 2024 00:57:05 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devesh Kumar Singh updated HDDS-11680:
--------------------------------------
    Description: 
For SCM metadata background tasks, we can leverage ICR and FCR metrics exposed 
by org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
This is used for for ICR/FCR both.

!image-2024-11-25-14-19-58-883.png|width=808,height=146!

So if any ICR/FCR events are in queue, we'll know container reports are still 
to be processed and ContainerHealthTask and PipelineSyncTask may not be showing 
upto date data.

Like this we have for pipeline reports as well. 

!image-2024-11-25-14-26-32-614.png|width=724,height=158!

Add new prometheus metrics for improved observability in Recon:

For both OM and SCM metadata background tasks:
 * lastRunStatus
 ** 0 - last run task status was success
 ** -1 - last run task status was fail.

 

Changes/modification in *RECON_TASK_STATUS* table:
 * Update the *last_updated_seq_num* value for all tasks in the 
RECON_TASK_STATUS table.
 * Add a new column as “{*}task_id{*}” in the *RECON_TASK_STATUS* table for 
each task instance when it starts. This way, we can keep track of each instance 
of task run and can differentiate between each task run instead of just 
updating and overwriting the same task run instance.
 * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table to 
track each instance run status. Tracking each run of task instances, we can 
show some important information about each task like how many times a task 
failed/succeeded in the last X units of time.

  was:
Add new prometheus metrics for improved observability in Recon:

For OM metadata background tasks:
 * numRemainingEventsToProcess \{type="ContainerHealthTask"}
 * numRemainingEventsToProcess \{type="FileSizeCountTask"}
 * numRemainingEventsToProcess \{type="OmTableInsightTask"}
 * numRemainingEventsToProcess \{type="NSSummaryTask"}

For SCM metadata background tasks:
 * lastSuccessfullyProcessedContainerReportTS
 * lastSuccessfullyProcessedPipelineReportTS

 

For both OM and SCM metadata background tasks:
 * lastRunStatus
 ** 0 - last run task status was success
 ** -1 - last run task status was fail.


> Ozone Recon - Enhance Prometheus Metrics For Improved Observability
> -------------------------------------------------------------------
>
>                 Key: HDDS-11680
>                 URL: https://issues.apache.org/jira/browse/HDDS-11680
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Recon
>            Reporter: Devesh Kumar Singh
>            Assignee: Abhishek Pal
>            Priority: Major
>         Attachments: image-2024-11-25-14-19-56-217.png, 
> image-2024-11-25-14-19-58-883.png, image-2024-11-25-14-26-32-614.png
>
>
> For SCM metadata background tasks, we can leverage ICR and FCR metrics 
> exposed by 
> org.apache.hadoop.hdds.server.events.FixedThreadPoolWithAffinityExecutor. 
> This is used for for ICR/FCR both.
> !image-2024-11-25-14-19-58-883.png|width=808,height=146!
> So if any ICR/FCR events are in queue, we'll know container reports are still 
> to be processed and ContainerHealthTask and PipelineSyncTask may not be 
> showing upto date data.
> Like this we have for pipeline reports as well. 
> !image-2024-11-25-14-26-32-614.png|width=724,height=158!
> Add new prometheus metrics for improved observability in Recon:
> For both OM and SCM metadata background tasks:
>  * lastRunStatus
>  ** 0 - last run task status was success
>  ** -1 - last run task status was fail.
>  
> Changes/modification in *RECON_TASK_STATUS* table:
>  * Update the *last_updated_seq_num* value for all tasks in the 
> RECON_TASK_STATUS table.
>  * Add a new column as “{*}task_id{*}” in the *RECON_TASK_STATUS* table for 
> each task instance when it starts. This way, we can keep track of each 
> instance of task run and can differentiate between each task run instead of 
> just updating and overwriting the same task run instance.
>  * Add a new column as “{*}task_status{*}” in the *RECON_TASK_STATUS* table 
> to track each instance run status. Tracking each run of task instances, we 
> can show some important information about each task like how many times a 
> task failed/succeeded in the last X units of time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11680) Ozone Recon - Enhance Prometheus Metrics For Improved Observability

Reply via email to