liaoxin01 opened a new pull request, #64734:
URL: https://github.com/apache/doris/pull/64734

   ## Proposed changes
   
   Add a `file_cache_warm_up_job_num` bvar metric that tracks the number of 
warm up jobs currently held in a BE's memory. This gives operators per-BE 
visibility into how many warm up jobs each backend is currently holding.
   
   ### What changed
   
   In `be/src/cloud/cloud_warm_up_manager.cpp`:
   
   - New `bvar::Adder<int64_t> 
g_file_cache_warm_up_job_num("file_cache_warm_up_job_num")`.
   - **+1** when FE dispatches a new job to this BE:
     - `check_and_set_job_id` — regular (cluster/table) warm up `SET_JOB`, on 
`_cur_job_id` transition `0 -> job_id`.
     - `check_and_set_batch_id` — defensive, same `0 -> job_id` transition 
(e.g. if a `SET_BATCH` were to arrive first).
     - `set_event` — event-driven `SET_JOB`, when a new `job_id` is inserted 
into `_tablet_replica_cache`.
   - **-1** when the job is cleared:
     - `clear_job` — regular `CLEAR_JOB`, only when a live job actually existed.
     - `set_event` (clear) — event-driven `CLEAR_JOB`, only when 
`_tablet_replica_cache.erase()` actually removed an entry.
   
   ### Why it is dedup-safe
   
   All increments/decrements are gated on real state transitions rather than on 
RPC arrival:
   
   - `_cur_job_id` is a single slot whose only reset to `0` is in `clear_job` 
(which carries the matching `-1`), so a regular job contributes exactly one 
`+1` and one `-1` per lifecycle. Repeated `SET_JOB` (retry, FE failover replay) 
hits the `_cur_job_id != 0` guard and does not double count.
   - Event-driven counting is guarded by 
`!_tablet_replica_cache.contains(job_id)` (add) and `erase(job_id) > 0` 
(clear), so duplicate `SET_JOB`/`CLEAR_JOB` are no-ops.
   
   The metric is process-local: on BE restart it resets to `0` along with 
`_cur_job_id` / `_tablet_replica_cache`, so an abandoned-but-not-cleared job 
(e.g. BE down during `CLEAR_JOB`) self-heals on restart.
   
   The value is exposed via the BE `/vars` endpoint as 
`file_cache_warm_up_job_num`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to