bosswnx opened a new pull request, #63244:
URL: https://github.com/apache/doris/pull/63244

   ### What problem does this PR solve?
   
   Issue Number: N/A
   
   Related PR: N/A
   
   Problem Summary:
   
   `WorkloadGroupMetrics::refresh_metrics()` computes per-second CPU time and
   local/remote scan bytes by dividing the delta of each counter by an interval
   derived from `config::workload_group_metrics_interval_ms`:
   
   ```cpp
   int interval_second = config::workload_group_metrics_interval_ms / 1000;
   ...
   _per_sec_cpu_time_nanos = (_current_cpu_time_nanos - _last_cpu_time_nanos) / 
interval_second;
   ```
   
   This is inaccurate for several reasons:
   
   1. The refresh thread is not guaranteed to fire exactly every
      `workload_group_metrics_interval_ms`. Under BE load or scheduling delays
      the actual gap between two refreshes can drift significantly, but the
      divisor is still the configured interval, so the reported per-second
      rates do not reflect reality.
   2. If `workload_group_metrics_interval_ms` is changed at runtime, the
      divisor updates immediately while the counter delta still spans the
      *old* interval, producing a one-shot incorrect rate.
   3. When `workload_group_metrics_interval_ms < 1000`, the integer division
      rounds the divisor down to `0`, which would cause a divide-by-zero.
   
   This PR replaces the fixed config-based interval with the **actual
   monotonic time delta** between two consecutive `refresh_metrics()`
   invocations:
   
   ```cpp
   uint64_t current_time_ms = MonotonicMillis();
   uint64_t interval_second = (current_time_ms - _last_refresh_time_ms) / 1000;
   _last_refresh_time_ms = current_time_ms;
   ```
   
   A new member `std::atomic<uint64_t> _last_refresh_time_ms{0}` is added to
   record the timestamp of the previous refresh. This makes the per-second
   CPU / local-scan / remote-scan metrics reflect the true elapsed wall-clock
   interval, regardless of refresh-thread jitter or runtime config changes.
   
   Known limitation: on the very first invocation after BE startup
   `_last_refresh_time_ms` is `0`, so `interval_second` becomes
   `MonotonicMillis() / 1000` (a large number) and the first sample of each
   per-second metric will be reported as near-zero. The values converge to
   correct readings from the second refresh onwards. A follow-up may
   initialize `_last_refresh_time_ms` to `MonotonicMillis()` in the
   constructor; kept out of this PR to keep the change minimal.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test
       - [x] No need to test or manual test. Explain why:
           - [x] Previous test can cover this change. The change only replaces
                 the per-second-rate divisor with a more accurate value; the
                 metric semantics, plumbing and existing workload-group metrics
                 tests remain unchanged.
   
   - Behavior changed:
       - [x] No.
   
   - Does this need documentation?
       - [x] No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to