Re: [PR] branch-4.1: [opt](cloud) cache cluster id per query and drop redundant locks on getBackendId hot path #63636 [doris]

via GitHub Mon, 08 Jun 2026 23:48:32 -0700


Copilot commented on code in PR #64276:
URL: https://github.com/apache/doris/pull/64276#discussion_r3378477825



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -120,18 +121,100 @@ public void 
renameVirtualClusterInfoFromMapsNoLock(String clusterId, String oldC
     }
 
     public ComputeGroup getComputeGroupByName(String computeGroupName) {
-        LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        // rlock guards the compound name->id->group lookup: writers 
(add/remove/rename)
+        // update both maps under wlock, and the read must observe a 
consistent snapshot
+        // so callers like getPhysicalCluster don't transiently see a virtual 
group name
+        // with a null group and fall back to treating it as a physical 
cluster.
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get id {} computeGroupIdToComputeGroup : {} ", 
computeGroupName, computeGroupIdToComputeGroup);
+        }
         try {
             rlock.lock();
-            if (!clusterNameToId.containsKey(computeGroupName)) {
-                return null;
-            }
-            return 
computeGroupIdToComputeGroup.get(clusterNameToId.get(computeGroupName));
+            String id = clusterNameToId.get(computeGroupName);
+            return id == null ? null : computeGroupIdToComputeGroup.get(id);
         } finally {
             rlock.unlock();
         }
     }
 
+    public boolean containsCloudCluster(String clusterName) {
+        return !Strings.isNullOrEmpty(clusterName) && 
clusterNameToId.containsKey(clusterName);
+    }
+
+    // Resolve the cluster id for the current ConnectContext: physical-cluster 
lookup,
+    // priv check, status check (reject MANUAL_SHUTDOWN), wait-for-autoStart, 
existence
+    // check, finally name->id mapping. The result is identical for every 
tablet/replica
+    // within a single request, so hot paths should resolve once and reuse the 
cached value.
+    public String getCurrentClusterId() throws ComputeGroupException {
+        ConnectContext context = ConnectContext.get();
+        if (context == null) {
+            if (LOG.isDebugEnabled()) {
+                LOG.debug("connect context is null in getCurrentClusterId");
+            }
+            throw new ComputeGroupException("connect context not set cluster ",
+                    
ComputeGroupException.FailedTypeEnum.CONNECT_CONTEXT_NOT_SET);
+        }
+
+        String cluster = getPhysicalCluster(context.getCloudCluster());
+        if (LOG.isDebugEnabled()) {
+            LOG.debug("get compute group by context {}", cluster);
+        }
+
+        UserIdentity currentUid = context.getCurrentUserIdentity();
+        if (currentUid == null || 
Strings.isNullOrEmpty(currentUid.getQualifiedUser())) {
+            LOG.info("connect context user is null.");
+            throw new ComputeGroupException("connect context's user is null",
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+        }
+        try {
+            ((CloudEnv) Env.getCurrentEnv()).checkCloudClusterPriv(cluster);
+        } catch (Exception e) {
+            LOG.warn("check compute group {} for {} auth failed.", cluster, 
currentUid);
+            throw new ComputeGroupException(
+                    String.format("context compute group %s check auth failed, 
user is %s", cluster, currentUid),
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+        }
+
+        String clusterStatus = getCloudStatusByName(cluster);
+        if (!Strings.isNullOrEmpty(clusterStatus)
+                && Cloud.ClusterStatus.valueOf(clusterStatus) == 
Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
+            LOG.warn("auto start compute group {} in manual shutdown status", 
cluster);
+            throw new ComputeGroupException(
+                    String.format("The current compute group %s has been 
manually shutdown", cluster),
+                    
ComputeGroupException.FailedTypeEnum.CURRENT_COMPUTE_GROUP_BEEN_MANUAL_SHUTDOWN);
+        }
+
+        return resolveClusterIdByName(cluster);
+    }
+
+    // Resolve a known cluster name to its id, handling auto-start (cluster 
may resume
+    // under a different name) and validating the cluster is registered.
+    public String resolveClusterIdByName(String cluster) throws 
ComputeGroupException {
+        String wakeUPCluster = "";
+        try {
+            wakeUPCluster = waitForAutoStart(cluster);
+        } catch (DdlException e) {
+            LOG.warn("cant resume compute group {}, exception", cluster, e);
+        }
+        if (!Strings.isNullOrEmpty(wakeUPCluster) && 
!cluster.equals(wakeUPCluster)) {
+            cluster = wakeUPCluster;
+            LOG.warn("get backend input compute group {} useless, so auto 
start choose a new one compute group {}",
+                    cluster, wakeUPCluster);
+        }

Review Comment:
   `resolveClusterIdByName` overwrites `cluster` with `wakeUPCluster` before 
logging, so the first `{}` in the warning always prints the *new* cluster name 
(often identical to the second placeholder) and loses the original input value. 
This makes the log misleading when auto-start selects a different compute group.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java:
##########
@@ -216,89 +231,45 @@ public long getClusterPrimaryBackendId(String clusterId) {
         return primaryClusterToBackend.getOrDefault(clusterId, -1L);
     }
 
-    private String getCurrentClusterId() throws ComputeGroupException {
-        // Not in a connect session
-        String cluster = null;
-        ConnectContext context = ConnectContext.get();
-        if (context != null) {
-            // TODO(wb) rethinking whether should update err status.
-            cluster = ((CloudSystemInfoService) Env.getCurrentSystemInfo())
-                    .getPhysicalCluster(context.getCloudCluster());
-
-            if (LOG.isDebugEnabled()) {
-                LOG.debug("get compute group by context {}", cluster);
-            }
-
-            UserIdentity currentUid = context.getCurrentUserIdentity();
-            if (currentUid != null && 
!StringUtils.isEmpty(currentUid.getQualifiedUser())) {
-                try {
-                    ((CloudEnv) 
Env.getCurrentEnv()).checkCloudClusterPriv(cluster);
-                } catch (Exception e) {
-                    LOG.warn("check compute group {} for {} auth failed.", 
cluster,
-                            context.getCurrentUserIdentity().toString());
-                    throw new ComputeGroupException(
-                            String.format("context compute group %s check auth 
failed, user is %s",
-                                    cluster, 
context.getCurrentUserIdentity().toString()),
-                            
ComputeGroupException.FailedTypeEnum.CURRENT_USER_NO_AUTH_TO_USE_DEFAULT_COMPUTE_GROUP);
+    // For proc display only. In cloud mode a replica is hashed to a different 
BE in each
+    // compute group, so expose a clusterId -> backendId mapping; the proc 
display builds
+    // a separate bucket sequence per compute group from it so each group's 
sequence is
+    // self-consistent. Do not collapse this into a single BE (e.g. the first 
one):
+    // backends differ across compute groups and would not match.
+    //
+    // ATTN: colocated replicas do NOT use primaryClusterToBackend (see 
getBackendIdImpl /
+    // getClusterPrimaryBackendId, which short-circuit to getColocatedBeId), 
so that cache
+    // is empty for them. Resolve their placement per compute group on the fly 
instead.
+    // This reads CloudSystemInfoService / the colocate index, so callers must 
invoke it
+    // OUTSIDE any table lock to avoid nested lock acquisition. It does not 
auto-start any
+    // compute group: getColocatedBeId only reads the already-known backends 
of a clusterId.
+    // computeGroupBackendCache maps compute group id -> 
getBackendsByClusterId() result and
+    // is shared across all replicas resolved in a single proc call, so each 
compute group's
+    // backend list is fetched only once instead of once per replica.
+    @Override
+    public Map<String, Long> getClusterToBackendForProcDisplay(Map<String, 
List<Backend>> computeGroupBackendCache) {
+        if (!isColocated()) {
+            return new HashMap<>(primaryClusterToBackend);
+        }
+        Map<String, Long> result = new HashMap<>();
+        CloudSystemInfoService infoService = (CloudSystemInfoService) 
Env.getCurrentSystemInfo();
+        for (String clusterId : infoService.getCloudClusterIds()) {
+            try {
+                List<Backend> clusterBackends =
+                        computeGroupBackendCache.computeIfAbsent(clusterId, 
infoService::getBackendsByClusterId);
+                long backendId = getColocatedBeId(clusterId, clusterBackends);

Review Comment:
   This calls `getColocatedBeId(clusterId, clusterBackends)`, but there is no 
such overload in `CloudReplica` (only `getColocatedBeId(String)` exists). 
As-is, this will not compile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] branch-4.1: [opt](cloud) cache cluster id per query and drop redundant locks on getBackendId hot path #63636 [doris]

Reply via email to