bosswnx opened a new pull request, #63244:
URL: https://github.com/apache/doris/pull/63244
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary:
`WorkloadGroupMetrics::refresh_metrics()` computes per-second CPU time and
local/remote scan bytes by dividing the delta of each counter by an interval
derived from `config::workload_group_metrics_interval_ms`:
```cpp
int interval_second = config::workload_group_metrics_interval_ms / 1000;
...
_per_sec_cpu_time_nanos = (_current_cpu_time_nanos - _last_cpu_time_nanos) /
interval_second;
```
This is inaccurate for several reasons:
1. The refresh thread is not guaranteed to fire exactly every
`workload_group_metrics_interval_ms`. Under BE load or scheduling delays
the actual gap between two refreshes can drift significantly, but the
divisor is still the configured interval, so the reported per-second
rates do not reflect reality.
2. If `workload_group_metrics_interval_ms` is changed at runtime, the
divisor updates immediately while the counter delta still spans the
*old* interval, producing a one-shot incorrect rate.
3. When `workload_group_metrics_interval_ms < 1000`, the integer division
rounds the divisor down to `0`, which would cause a divide-by-zero.
This PR replaces the fixed config-based interval with the **actual
monotonic time delta** between two consecutive `refresh_metrics()`
invocations:
```cpp
uint64_t current_time_ms = MonotonicMillis();
uint64_t interval_second = (current_time_ms - _last_refresh_time_ms) / 1000;
_last_refresh_time_ms = current_time_ms;
```
A new member `std::atomic<uint64_t> _last_refresh_time_ms{0}` is added to
record the timestamp of the previous refresh. This makes the per-second
CPU / local-scan / remote-scan metrics reflect the true elapsed wall-clock
interval, regardless of refresh-thread jitter or runtime config changes.
Known limitation: on the very first invocation after BE startup
`_last_refresh_time_ms` is `0`, so `interval_second` becomes
`MonotonicMillis() / 1000` (a large number) and the first sample of each
per-second metric will be reported as near-zero. The values converge to
correct readings from the second refresh onwards. A follow-up may
initialize `_last_refresh_time_ms` to `MonotonicMillis()` in the
constructor; kept out of this PR to keep the change minimal.
### Release note
None
### Check List (For Author)
- Test
- [x] No need to test or manual test. Explain why:
- [x] Previous test can cover this change. The change only replaces
the per-second-rate divisor with a more accurate value; the
metric semantics, plumbing and existing workload-group metrics
tests remain unchanged.
- Behavior changed:
- [x] No.
- Does this need documentation?
- [x] No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]