liaoxin01 opened a new pull request, #63636:
URL: https://github.com/apache/doris/pull/63636
## Proposed changes
Issue Number: close #xxx
### Problem
`CloudReplica.getBackendId` was the dominant FE hotspot in cloud-mode
query/load planning. On a profiled query it accounted for **65.6% of total FE
CPU samples** (async-profiler). Every replica in the plan re-ran the full
cluster-id resolution pipeline even though the resolved cluster id is identical
for every tablet in the same request.
Call breakdown from the flame graph:
```
Tablet.getNormalReplicaBackendPathMap 68.6%
└─ CloudReplica.getBackendId 65.6%
└─ getCurrentClusterId 61.5%
├─ getCloudClusterIdByName 35.6%
│ ├─ getCloudClusterNames 24.7% (rw-lock +
ArrayList + stream.sorted)
│ └─ waitForAutoStart 9.7%
(StopWatch.start/stop even on NORMAL)
├─ getPhysicalCluster 14.5%
│ └─ getComputeGroupByName 14.5% (rw-lock around
ConcurrentHashMap)
└─ getCloudStatusByName 8.3%
└─ getCloudStatusByIdNoLock 14.0% (two stream passes,
String.valueOf per BE)
```
`ReadLock.unlock` consistently outweighed `ReadLock.lock` in the profile --
a classic cache-line bouncing signature, meaning the read-lock CAS was already
operating in the non-linear regime where adding concurrent threads
disproportionately hurts throughput. With high tablet counts the call rate
(`tablets × replicas × concurrent_queries`) easily reached 100k+ lock/unlock
per second per FE, putting the FE one nudge away from a metastable collapse.
### Changes
- **`CloudReplica.getCurrentClusterId`** becomes `public static` so callers
can resolve once per request. Adds `getBackendIdWithClusterId(String)` that
bypasses the per-replica pipeline.
- **`OlapTableSink.createLocation`** and
**`FrontendServiceImpl.{createPartition, replacePartition}`** resolve the
cluster id once before iterating tablets and pass it down via the new
`CloudTablet.getNormalReplicaBackendPathMap(String)` overload. For a 10k-tablet
query this collapses 10k full pipelines into 1.
- **`CloudSystemInfoService.getComputeGroupByName`**: drop the rw-lock
around `ConcurrentHashMap` reads, merge `containsKey+get` into a single `get`,
guard the debug log behind `isDebugEnabled` (the map `toString` is expensive).
- **`CloudSystemInfoService.containsCloudCluster`** (new): cheap existence
check that replaces `getCloudClusterNames().contains(name)`. Avoids an
ArrayList copy + stream filter + natural sort + collect under a read lock for a
single existence query.
- **`CloudSystemInfoService.getCloudStatusByIdNoLock`**: single-pass loop
with a precomputed `NORMAL` constant in place of two stream pipelines that
re-evaluated `String.valueOf(NORMAL)` per backend.
- **`CloudSystemInfoService.waitForAutoStart`**: fast-path return when the
cluster is already `NORMAL`, skipping the `withTemporaryNereidsTimeout` wrap
and the `StopWatch.start/stop` inside `waitForClusterToResume` (the while loop
never executed in that state but the wrapping still ran per call -- ~3% of
total FE CPU in the profile).
### Behavior
Semantics preserved on all paths:
- `waitForAutoStart` NORMAL fast-path: the original code already had
`existAliveBe = true` as initializer, so `waitForClusterToResume` was a no-op
for NORMAL clusters.
- `getComputeGroupByName` without rw-lock: the brief window between a
rename's two map updates can now return `null`; same as the existing read-only
paths everywhere else in this class that already access these maps without
locks.
- `containsCloudCluster` matches `getCloudClusterNames().contains(name)` for
non-empty `name` (empty-name filtering in `getCloudClusterNames` was for the
returned list, not for `.contains` semantics).
### Further comments
The cluster id resolution is hoisted only on the no-BE-endpoint paths (the
hot ones in the profile). The endpoint-resolved path in `FrontendServiceImpl`
already has its own resolution logic and is untouched.
## Checklist(Required)
1. Does it affect the original behavior:
- [ ] Yes
- [x] No
- [ ] I don't know
2. Has unit tests been added:
- [ ] Yes
- [x] No Need
- [ ] Same as the original logic
3. Has document been added or modified:
- [ ] Yes
- [x] No Need
4. Does it need to update dependencies:
- [ ] Yes
- [x] No
5. Are there any changes that cannot be rolled back:
- [ ] Yes (If Yes, please explain WHY)
- [x] No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]