[jira] [Commented] (HDDS-11511) OM deletion services should have consistent metrics

Ethan Rose (Jira) Mon, 25 Nov 2024 10:01:24 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900984#comment-17900984
 ]


Ethan Rose commented on HDDS-11511:
-----------------------------------

{quote}Charting is done via prometheus/grafana for incrementing at every 
iteration. So its not required to be iteration level metrics. These tool can 
easily handle this. We do not need iteration metrics for HDDS-11512.
{quote}
If I'm understanding correctly, you are saying that we should continuously 
update the metrics as the services are running, and let the prometheus/grafana 
sync mechanisms determine the time increments. All our configurations are based 
on iteration interval and deletions per iteration, so we do need some iteration 
based information to accurately tune configs based on the dashboards. I guess 
there are a few implementations we could go with:
 # Metrics are continuously incremented as the service is running and are never 
reset until restart.
 # Metrics are incremented in a batch at the end of each run of the service and 
are never reset until restart.
 # Metrics are reset in a batch at the end of each service run.

I'm not sure exactly which approach you are referring to, but 1 will make our 
iteration based configuration tuning difficult. 2 will work as long as the 
metrics are sampled more frequently than the deletion services run, which I 
imagine would usually be the case. It would not be useful without a 
dashboarding setup to handle subtraction from the last sample to see the delta 
from the last iteration though. 3 would work in any scenario. For continuous 
metrics like ops/sec or network bandwidth used, our only option is continuous 
increments at sampled intervals. Services that run on a schedule are different 
though, and we should probably define a standard across Ozone for how we want 
to report these.
{quote}If last iteration information is required, can be obtained from logs.
{quote}
This PR is to set us up to be able to go from dashboards to config tuning 
without checking the logs. Ideally logs would be treated as a last resort for 
more complex issues where dashboards are not effective.
{quote}Further, with multi-threaded, it will be more complicated to define 
iteration, and merge all metrics including all threads.
{quote}
Again, all our configurations are per iteration, which is necessary to prevent 
the services from constantly running. Therefore multi-threading PRs will need 
to define what constitutes an iteration, and we can re-use that definition 
here. The way I see this working is that each iteration would use the specified 
number of worker threads to reach its configured iteration limit, and once all 
work is done we would publish one aggregate set of metrics for the run. We 
could optionally add per-thread metrics in addition to that but it would be for 
lower level debugging.

I think a design doc outlining how all of these can work together would help 
since it seems like there's confusion both here and in some of the existing 
PRs. Let me work on that to facilitate a clearer discussion.

> OM deletion services should have consistent metrics
> ---------------------------------------------------
>
>                 Key: HDDS-11511
>                 URL: https://issues.apache.org/jira/browse/HDDS-11511
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Assignee: Tejaskriya Madhan
>            Priority: Major
>              Labels: pull-request-available
>
> All background deletion services in Ozone should publish the same set of 
> metrics for each thread:
> * Number of items handled in the last iteration
> ** For OM's directory deleting service, handling of files, empty dirs, and 
> non-empty dirs may be tracked with different metrics.
> ** For services where one DB key is multiple blocks (key delete, SCM block 
> transactions), a separate metric should exist for the number of blocks 
> deleted in the iteration and the number of items processed.
> * Time spent in the last iteration
> Some services may already have these metrics, but some specifically in the OM 
> do not. This Jira should review all the services and fill in these metrics 
> where they are missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-11511) OM deletion services should have consistent metrics

Reply via email to