[
https://issues.apache.org/jira/browse/HDDS-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900984#comment-17900984
]
Ethan Rose commented on HDDS-11511:
-----------------------------------
{quote}Charting is done via prometheus/grafana for incrementing at every
iteration. So its not required to be iteration level metrics. These tool can
easily handle this. We do not need iteration metrics for HDDS-11512.
{quote}
If I'm understanding correctly, you are saying that we should continuously
update the metrics as the services are running, and let the prometheus/grafana
sync mechanisms determine the time increments. All our configurations are based
on iteration interval and deletions per iteration, so we do need some iteration
based information to accurately tune configs based on the dashboards. I guess
there are a few implementations we could go with:
# Metrics are continuously incremented as the service is running and are never
reset until restart.
# Metrics are incremented in a batch at the end of each run of the service and
are never reset until restart.
# Metrics are reset in a batch at the end of each service run.
I'm not sure exactly which approach you are referring to, but 1 will make our
iteration based configuration tuning difficult. 2 will work as long as the
metrics are sampled more frequently than the deletion services run, which I
imagine would usually be the case. It would not be useful without a
dashboarding setup to handle subtraction from the last sample to see the delta
from the last iteration though. 3 would work in any scenario. For continuous
metrics like ops/sec or network bandwidth used, our only option is continuous
increments at sampled intervals. Services that run on a schedule are different
though, and we should probably define a standard across Ozone for how we want
to report these.
{quote}If last iteration information is required, can be obtained from logs.
{quote}
This PR is to set us up to be able to go from dashboards to config tuning
without checking the logs. Ideally logs would be treated as a last resort for
more complex issues where dashboards are not effective.
{quote}Further, with multi-threaded, it will be more complicated to define
iteration, and merge all metrics including all threads.
{quote}
Again, all our configurations are per iteration, which is necessary to prevent
the services from constantly running. Therefore multi-threading PRs will need
to define what constitutes an iteration, and we can re-use that definition
here. The way I see this working is that each iteration would use the specified
number of worker threads to reach its configured iteration limit, and once all
work is done we would publish one aggregate set of metrics for the run. We
could optionally add per-thread metrics in addition to that but it would be for
lower level debugging.
I think a design doc outlining how all of these can work together would help
since it seems like there's confusion both here and in some of the existing
PRs. Let me work on that to facilitate a clearer discussion.
> OM deletion services should have consistent metrics
> ---------------------------------------------------
>
> Key: HDDS-11511
> URL: https://issues.apache.org/jira/browse/HDDS-11511
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ethan Rose
> Assignee: Tejaskriya Madhan
> Priority: Major
> Labels: pull-request-available
>
> All background deletion services in Ozone should publish the same set of
> metrics for each thread:
> * Number of items handled in the last iteration
> ** For OM's directory deleting service, handling of files, empty dirs, and
> non-empty dirs may be tracked with different metrics.
> ** For services where one DB key is multiple blocks (key delete, SCM block
> transactions), a separate metric should exist for the number of blocks
> deleted in the iteration and the number of items processed.
> * Time spent in the last iteration
> Some services may already have these metrics, but some specifically in the OM
> do not. This Jira should review all the services and fill in these metrics
> where they are missing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]