Re: [PR] [fix](cloud) Fix auto-start functionality when encountering TVF and external queries [doris]

via GitHub Tue, 27 Jan 2026 01:52:23 -0800


deardeng commented on code in PR #59963:
URL: https://github.com/apache/doris/pull/59963#discussion_r2731162917



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -1416,91 +1420,159 @@ public String getClusterNameAutoStart(final String 
clusterName) {
     }
 
     public String waitForAutoStart(String clusterName) throws DdlException {
-        if (Config.isNotCloudMode()) {
-            return null;
-        }
-        if (!Config.enable_auto_start_for_cloud_cluster) {
+        if (Config.isNotCloudMode() || 
!Config.enable_auto_start_for_cloud_cluster) {
             return null;
         }
-        clusterName = getClusterNameAutoStart(clusterName);
-        if (Strings.isNullOrEmpty(clusterName)) {
-            LOG.warn("auto start in cloud mode, but clusterName empty {}", 
clusterName);
+        String resolvedClusterName = getClusterNameAutoStart(clusterName);
+        if (Strings.isNullOrEmpty(resolvedClusterName)) {
+            LOG.warn("auto start in cloud mode, but clusterName empty {}", 
resolvedClusterName);
             return null;
         }
-        String clusterStatus = getCloudStatusByName(clusterName);
-        if (Strings.isNullOrEmpty(clusterStatus)) {
+        String clusterStatusStr = getCloudStatusByName(resolvedClusterName);
+        Cloud.ClusterStatus clusterStatus = 
parseClusterStatusOrNull(clusterStatusStr, resolvedClusterName);
+        if (clusterStatus == null) {
+            LOG.warn("auto start in cloud mode, but clusterStatus empty {}", 
clusterStatusStr);
             // for cluster rename or cluster dropped
-            LOG.warn("cant find clusterStatus in fe, clusterName {}", 
clusterName);
             return null;
         }
 
-        if (Cloud.ClusterStatus.valueOf(clusterStatus) == 
Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
-            LOG.warn("auto start cluster {} in manual shutdown status", 
clusterName);
-            throw new DdlException("cluster " + clusterName + " is in manual 
shutdown");
+        if (clusterStatus == Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
+            LOG.warn("auto start cluster {} in manual shutdown status", 
resolvedClusterName);
+            throw new DdlException("cluster " + resolvedClusterName + " is in 
manual shutdown");
         }
 
-        // nofity ms -> wait for clusterStatus to normal
-        LOG.debug("auto start wait cluster {} status {}", clusterName, 
clusterStatus);
-        if (Cloud.ClusterStatus.valueOf(clusterStatus) != 
Cloud.ClusterStatus.NORMAL) {
+        // notify ms -> wait for clusterStatus to normal
+        LOG.debug("auto start wait cluster {} status {}", resolvedClusterName, 
clusterStatus);
+        if (clusterStatus != Cloud.ClusterStatus.NORMAL) {
             // ATTN: prevent `Automatic Analyzer` daemon threads from pulling 
up clusters
             // FeConstants.INTERNAL_DB_NAME ? see 
StatisticsUtil.buildConnectContext
-            List<String> ignoreDbNameList = 
Arrays.asList(Config.auto_start_ignore_resume_db_names);
-            if (ConnectContext.get() != null && 
ignoreDbNameList.contains(ConnectContext.get().getDatabase())) {
+            ConnectContext ctx = ConnectContext.get();
+            if (shouldIgnoreAutoStart(ctx)) {
                 LOG.warn("auto start daemon thread db {}, not resume cluster 
{}-{}",
-                        ConnectContext.get().getDatabase(), clusterName, 
clusterStatus);
+                        ctx.getDatabase(), resolvedClusterName, clusterStatus);
                 return null;
             }
-            Cloud.AlterClusterRequest.Builder builder = 
Cloud.AlterClusterRequest.newBuilder();
-            builder.setCloudUniqueId(Config.cloud_unique_id);
-            builder.setRequestIp(FrontendOptions.getLocalHostAddressCached());
-            
builder.setOp(Cloud.AlterClusterRequest.Operation.SET_CLUSTER_STATUS);
+            notifyMetaServiceToResumeCluster(resolvedClusterName);
+        }
+        // wait 5 mins
+        int retryTimes = Config.auto_start_wait_to_resume_times < 0 ? 300 : 
Config.auto_start_wait_to_resume_times;
+        String finalClusterName = resolvedClusterName;
+        String initialClusterStatus = clusterStatusStr;
+        withTemporaryNereidsTimeout(() -> {
+            waitForClusterToResume(finalClusterName, retryTimes, 
initialClusterStatus);
+        });
+        return resolvedClusterName;
+    }
 
-            ClusterPB.Builder clusterBuilder = ClusterPB.newBuilder();
-            clusterBuilder.setClusterId(getCloudClusterIdByName(clusterName));
-            clusterBuilder.setClusterStatus(Cloud.ClusterStatus.TO_RESUME);
-            builder.setCluster(clusterBuilder);
+    private Cloud.ClusterStatus parseClusterStatusOrNull(String 
clusterStatusStr, String clusterName) {
+        if (Strings.isNullOrEmpty(clusterStatusStr)) {
+            // for cluster rename or cluster dropped
+            LOG.warn("cant find clusterStatus in fe, clusterName {}", 
clusterName);
+            return null;
+        }
+        try {
+            return Cloud.ClusterStatus.valueOf(clusterStatusStr);
+        } catch (Throwable t) {
+            LOG.warn("invalid clusterStatus {} for clusterName {}", 
clusterStatusStr, clusterName, t);
+            return null;
+        }
+    }
 
-            Cloud.AlterClusterResponse response;
-            try {
-                Cloud.AlterClusterRequest request = builder.build();
-                response = 
MetaServiceProxy.getInstance().alterCluster(request);
-                LOG.info("alter cluster, request: {}, response: {}", request, 
response);
-                if (response.getStatus().getCode() != 
Cloud.MetaServiceCode.OK) {
-                    LOG.warn("notify to resume cluster not ok, cluster {}, 
response: {}", clusterName, response);
-                }
-                LOG.info("notify to resume cluster {}, response: {} ", 
clusterName, response);
-            } catch (RpcException e) {
-                LOG.warn("failed to notify to resume cluster {}", clusterName, 
e);
-                throw new DdlException("notify to resume cluster not ok");
+    private boolean shouldIgnoreAutoStart(ConnectContext ctx) {
+        if (ctx == null) {
+            return false;
+        }
+        String dbName = ctx.getDatabase();
+        if (Strings.isNullOrEmpty(dbName) || 
Config.auto_start_ignore_resume_db_names == null) {
+            return false;
+        }
+        for (String ignore : Config.auto_start_ignore_resume_db_names) {
+            if (dbName.equals(ignore)) {
+                return true;
             }
         }
-        // wait 5 mins
-        int retryTimes = Config.auto_start_wait_to_resume_times < 0 ? 300 : 
Config.auto_start_wait_to_resume_times;
+        return false;
+    }
+
+    private void notifyMetaServiceToResumeCluster(String clusterName) throws 
DdlException {
+        Cloud.AlterClusterRequest.Builder builder = 
Cloud.AlterClusterRequest.newBuilder();
+        builder.setCloudUniqueId(Config.cloud_unique_id);
+        builder.setRequestIp(FrontendOptions.getLocalHostAddressCached());
+        builder.setOp(Cloud.AlterClusterRequest.Operation.SET_CLUSTER_STATUS);
+
+        ClusterPB.Builder clusterBuilder = ClusterPB.newBuilder();
+        clusterBuilder.setClusterId(getCloudClusterIdByName(clusterName));
+        clusterBuilder.setClusterStatus(Cloud.ClusterStatus.TO_RESUME);
+        builder.setCluster(clusterBuilder);
+
+        try {
+            Cloud.AlterClusterRequest request = builder.build();
+            Cloud.AlterClusterResponse response = 
MetaServiceProxy.getInstance().alterCluster(request);
+            LOG.info("alter cluster, request: {}, response: {}", request, 
response);
+            if (response.getStatus().getCode() != Cloud.MetaServiceCode.OK) {
+                LOG.warn("notify to resume cluster not ok, cluster {}, 
response: {}", clusterName, response);
+            }
+            LOG.info("notify to resume cluster {}, response: {} ", 
clusterName, response);
+        } catch (RpcException e) {
+            LOG.warn("failed to notify to resume cluster {}", clusterName, e);
+            throw new DdlException("notify to resume cluster not ok");
+        }
+    }
+
+    /**
+     * Wait for cluster to resume to NORMAL status with alive backends.
+     * @param clusterName the name of the cluster
+     * @param retryTimes maximum number of retry attempts
+     * @param initialClusterStatus the initial cluster status
+     * @throws DdlException if the cluster fails to resume within the retry 
limit
+     */
+    private void waitForClusterToResume(String clusterName, int retryTimes, 
String initialClusterStatus)
+            throws DdlException {
         int retryTime = 0;
         StopWatch stopWatch = new StopWatch();
         stopWatch.start();
         boolean hasAutoStart = false;
         boolean existAliveBe = true;
-        while 
((!String.valueOf(Cloud.ClusterStatus.NORMAL).equals(clusterStatus) || 
!existAliveBe)
+        String clusterStatusStr = initialClusterStatus;
+        Cloud.ClusterStatus clusterStatus = 
parseClusterStatusOrNull(clusterStatusStr, clusterName);
+        Cloud.ClusterStatus lastLoggedStatus = clusterStatus;
+        boolean lastLoggedExistAliveBe = existAliveBe;
+
+        while ((clusterStatus != Cloud.ClusterStatus.NORMAL || !existAliveBe)
             && retryTime < retryTimes) {
             hasAutoStart = true;
             ++retryTime;
             // sleep random millis [0.5, 1] s
-            int randomSeconds =  500 + (int) (Math.random() * (1000 - 500));
-            LOG.info("resume cluster {} retry times {}, wait randomMillis: {}, 
current status: {}",
-                    clusterName, retryTime, randomSeconds, clusterStatus);
+            int sleepMs = ThreadLocalRandom.current().nextInt(500, 1001);

Review Comment:
   **Good improvement**: Using `ThreadLocalRandom` instead of `Math.random()` 
is a better practice for thread safety and performance.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/jobs/scheduler/SimpleJobScheduler.java:
##########
@@ -35,9 +37,21 @@ public void executeJobPool(ScheduleContext scheduleContext) {
         SessionVariable sessionVariable = 
context.getConnectContext().getSessionVariable();
         while (!pool.isEmpty()) {
             long elapsedS = 
context.getStatementContext().getStopwatch().elapsed(TimeUnit.MILLISECONDS) / 
1000;
-            if (sessionVariable.enableNereidsTimeout && elapsedS > 
sessionVariable.nereidsTimeoutSecond) {
-                throw QueryPlanningErrors.planTimeoutError(elapsedS, 
sessionVariable.nereidsTimeoutSecond,
-                        
context.getConnectContext().getExecutor().getSummaryProfile());
+            if (sessionVariable.enableNereidsTimeout) {
+                SummaryProfile summaryProfile = 
context.getConnectContext().getExecutor().getSummaryProfile();
+                if (summaryProfile.isWarmup()) {
+                    // Fix errCode = 2, detailMessage = Nereids cost too much 
time (36s > 30s).
+                    // For warmup queries, use a longer timeout (300 seconds)
+                    if (elapsedS > Config.auto_start_wait_to_resume_times) {

Review Comment:
   **Same Issue as RewriteJob**: `Config.auto_start_wait_to_resume_times` is 
being used as seconds, but it represents retry count. This is inconsistent and 
error-prone.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -1513,9 +1585,47 @@ public String waitForAutoStart(String clusterName) 
throws DdlException {
         if (hasAutoStart) {
             LOG.info("auto start cluster {}, start cost {} ms", clusterName, 
stopWatch.getTime());
         }
-        return clusterName;
     }
 
+    /**
+     * Temporarily set nereids timeout and restore it after execution.
+     * @param runnable the code to execute with the temporary timeout
+     * @throws DdlException if the runnable throws DdlException
+     */
+    private void withTemporaryNereidsTimeout(RunnableWithException runnable) 
throws DdlException {
+        ConnectContext ctx = ConnectContext.get();
+        if (ctx == null) {
+            runnable.run();
+            return;
+        }
+
+        SessionVariable sessionVariable = ctx.getSessionVariable();
+        if (!sessionVariable.enableNereidsTimeout) {
+            runnable.run();
+            return;
+        }
+
+        StmtExecutor executor = ctx.getExecutor();
+        if (executor == null) {
+            runnable.run();
+            return;
+        }
+
+        SummaryProfile profile = ctx.getExecutor().getSummaryProfile();
+        if (profile == null) {
+            runnable.run();
+            return;
+        }
+        profile.setWarmup(true);

Review Comment:
   **Issue**: The `warmup` flag is set to `true` but never reset to `false` 
after the operation completes. This could cause subsequent queries in the same 
session to be incorrectly treated as warmup queries, leading to incorrect 
timeout behavior.
   
   **Suggestion**: Use a try-finally block to ensure the flag is reset:
   ```java
   boolean originalWarmup = profile.isWarmup();
   try {
       profile.setWarmup(true);
       runnable.run();
   } finally {
       profile.setWarmup(originalWarmup);
   }
   ```



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/jobs/rewrite/RewriteJob.java:
##########
@@ -36,9 +38,20 @@ default void checkTimeout(JobContext jobContext) {
         CascadesContext context = jobContext.getCascadesContext();
         SessionVariable sessionVariable = 
context.getConnectContext().getSessionVariable();
         long elapsedS = 
context.getStatementContext().getStopwatch().elapsed(TimeUnit.MILLISECONDS) / 
1000;
-        if (sessionVariable.enableNereidsTimeout && elapsedS > 
sessionVariable.nereidsTimeoutSecond) {
-            throw QueryPlanningErrors.planTimeoutError(elapsedS, 
sessionVariable.nereidsTimeoutSecond,
-                    
context.getConnectContext().getExecutor().getSummaryProfile());
+        if (sessionVariable.enableNereidsTimeout) {
+            SummaryProfile summaryProfile = 
context.getConnectContext().getExecutor().getSummaryProfile();
+            if (summaryProfile.isWarmup()) {
+                // For warmup queries, use a longer timeout (300 seconds)
+                if (elapsedS > Config.auto_start_wait_to_resume_times) {

Review Comment:
   **Critical Issue**: `Config.auto_start_wait_to_resume_times` represents the 
number of retry attempts, not seconds. Using it directly as a timeout value in 
seconds is semantically incorrect. 
   
   While the default value of 300 happens to work (300 retries ≈ 300 seconds), 
this creates confusion and potential bugs if the config value changes.
   
   **Suggestion**: Consider using a separate config like 
`auto_start_wait_timeout_seconds` for the timeout, or calculate the timeout 
based on retry times and sleep intervals (e.g., `retryTimes * averageSleepMs / 
1000`).



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########


Review Comment:
   **Good improvement**: The refactoring of `waitForAutoStart` method improves 
code readability and maintainability. The extraction of helper methods 
(`parseClusterStatusOrNull`, `shouldIgnoreAutoStart`, 
`notifyMetaServiceToResumeCluster`, `waitForClusterToResume`) makes the code 
more modular.



##########
fe/fe-core/src/main/java/org/apache/doris/cloud/system/CloudSystemInfoService.java:
##########
@@ -1416,91 +1420,159 @@ public String getClusterNameAutoStart(final String 
clusterName) {
     }
 
     public String waitForAutoStart(String clusterName) throws DdlException {
-        if (Config.isNotCloudMode()) {
-            return null;
-        }
-        if (!Config.enable_auto_start_for_cloud_cluster) {
+        if (Config.isNotCloudMode() || 
!Config.enable_auto_start_for_cloud_cluster) {
             return null;
         }
-        clusterName = getClusterNameAutoStart(clusterName);
-        if (Strings.isNullOrEmpty(clusterName)) {
-            LOG.warn("auto start in cloud mode, but clusterName empty {}", 
clusterName);
+        String resolvedClusterName = getClusterNameAutoStart(clusterName);
+        if (Strings.isNullOrEmpty(resolvedClusterName)) {
+            LOG.warn("auto start in cloud mode, but clusterName empty {}", 
resolvedClusterName);
             return null;
         }
-        String clusterStatus = getCloudStatusByName(clusterName);
-        if (Strings.isNullOrEmpty(clusterStatus)) {
+        String clusterStatusStr = getCloudStatusByName(resolvedClusterName);
+        Cloud.ClusterStatus clusterStatus = 
parseClusterStatusOrNull(clusterStatusStr, resolvedClusterName);
+        if (clusterStatus == null) {
+            LOG.warn("auto start in cloud mode, but clusterStatus empty {}", 
clusterStatusStr);
             // for cluster rename or cluster dropped
-            LOG.warn("cant find clusterStatus in fe, clusterName {}", 
clusterName);
             return null;
         }
 
-        if (Cloud.ClusterStatus.valueOf(clusterStatus) == 
Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
-            LOG.warn("auto start cluster {} in manual shutdown status", 
clusterName);
-            throw new DdlException("cluster " + clusterName + " is in manual 
shutdown");
+        if (clusterStatus == Cloud.ClusterStatus.MANUAL_SHUTDOWN) {
+            LOG.warn("auto start cluster {} in manual shutdown status", 
resolvedClusterName);
+            throw new DdlException("cluster " + resolvedClusterName + " is in 
manual shutdown");
         }
 
-        // nofity ms -> wait for clusterStatus to normal
-        LOG.debug("auto start wait cluster {} status {}", clusterName, 
clusterStatus);
-        if (Cloud.ClusterStatus.valueOf(clusterStatus) != 
Cloud.ClusterStatus.NORMAL) {
+        // notify ms -> wait for clusterStatus to normal
+        LOG.debug("auto start wait cluster {} status {}", resolvedClusterName, 
clusterStatus);
+        if (clusterStatus != Cloud.ClusterStatus.NORMAL) {
             // ATTN: prevent `Automatic Analyzer` daemon threads from pulling 
up clusters
             // FeConstants.INTERNAL_DB_NAME ? see 
StatisticsUtil.buildConnectContext
-            List<String> ignoreDbNameList = 
Arrays.asList(Config.auto_start_ignore_resume_db_names);
-            if (ConnectContext.get() != null && 
ignoreDbNameList.contains(ConnectContext.get().getDatabase())) {
+            ConnectContext ctx = ConnectContext.get();
+            if (shouldIgnoreAutoStart(ctx)) {
                 LOG.warn("auto start daemon thread db {}, not resume cluster 
{}-{}",
-                        ConnectContext.get().getDatabase(), clusterName, 
clusterStatus);
+                        ctx.getDatabase(), resolvedClusterName, clusterStatus);
                 return null;
             }
-            Cloud.AlterClusterRequest.Builder builder = 
Cloud.AlterClusterRequest.newBuilder();
-            builder.setCloudUniqueId(Config.cloud_unique_id);
-            builder.setRequestIp(FrontendOptions.getLocalHostAddressCached());
-            
builder.setOp(Cloud.AlterClusterRequest.Operation.SET_CLUSTER_STATUS);
+            notifyMetaServiceToResumeCluster(resolvedClusterName);
+        }
+        // wait 5 mins
+        int retryTimes = Config.auto_start_wait_to_resume_times < 0 ? 300 : 
Config.auto_start_wait_to_resume_times;
+        String finalClusterName = resolvedClusterName;
+        String initialClusterStatus = clusterStatusStr;
+        withTemporaryNereidsTimeout(() -> {
+            waitForClusterToResume(finalClusterName, retryTimes, 
initialClusterStatus);
+        });
+        return resolvedClusterName;
+    }
 
-            ClusterPB.Builder clusterBuilder = ClusterPB.newBuilder();
-            clusterBuilder.setClusterId(getCloudClusterIdByName(clusterName));
-            clusterBuilder.setClusterStatus(Cloud.ClusterStatus.TO_RESUME);
-            builder.setCluster(clusterBuilder);
+    private Cloud.ClusterStatus parseClusterStatusOrNull(String 
clusterStatusStr, String clusterName) {
+        if (Strings.isNullOrEmpty(clusterStatusStr)) {
+            // for cluster rename or cluster dropped
+            LOG.warn("cant find clusterStatus in fe, clusterName {}", 
clusterName);
+            return null;
+        }
+        try {
+            return Cloud.ClusterStatus.valueOf(clusterStatusStr);
+        } catch (Throwable t) {
+            LOG.warn("invalid clusterStatus {} for clusterName {}", 
clusterStatusStr, clusterName, t);
+            return null;
+        }
+    }
 
-            Cloud.AlterClusterResponse response;
-            try {
-                Cloud.AlterClusterRequest request = builder.build();
-                response = 
MetaServiceProxy.getInstance().alterCluster(request);
-                LOG.info("alter cluster, request: {}, response: {}", request, 
response);
-                if (response.getStatus().getCode() != 
Cloud.MetaServiceCode.OK) {
-                    LOG.warn("notify to resume cluster not ok, cluster {}, 
response: {}", clusterName, response);
-                }
-                LOG.info("notify to resume cluster {}, response: {} ", 
clusterName, response);
-            } catch (RpcException e) {
-                LOG.warn("failed to notify to resume cluster {}", clusterName, 
e);
-                throw new DdlException("notify to resume cluster not ok");
+    private boolean shouldIgnoreAutoStart(ConnectContext ctx) {
+        if (ctx == null) {
+            return false;
+        }
+        String dbName = ctx.getDatabase();
+        if (Strings.isNullOrEmpty(dbName) || 
Config.auto_start_ignore_resume_db_names == null) {
+            return false;
+        }
+        for (String ignore : Config.auto_start_ignore_resume_db_names) {
+            if (dbName.equals(ignore)) {
+                return true;
             }
         }
-        // wait 5 mins
-        int retryTimes = Config.auto_start_wait_to_resume_times < 0 ? 300 : 
Config.auto_start_wait_to_resume_times;
+        return false;
+    }
+
+    private void notifyMetaServiceToResumeCluster(String clusterName) throws 
DdlException {
+        Cloud.AlterClusterRequest.Builder builder = 
Cloud.AlterClusterRequest.newBuilder();
+        builder.setCloudUniqueId(Config.cloud_unique_id);
+        builder.setRequestIp(FrontendOptions.getLocalHostAddressCached());
+        builder.setOp(Cloud.AlterClusterRequest.Operation.SET_CLUSTER_STATUS);
+
+        ClusterPB.Builder clusterBuilder = ClusterPB.newBuilder();
+        clusterBuilder.setClusterId(getCloudClusterIdByName(clusterName));
+        clusterBuilder.setClusterStatus(Cloud.ClusterStatus.TO_RESUME);
+        builder.setCluster(clusterBuilder);
+
+        try {
+            Cloud.AlterClusterRequest request = builder.build();
+            Cloud.AlterClusterResponse response = 
MetaServiceProxy.getInstance().alterCluster(request);
+            LOG.info("alter cluster, request: {}, response: {}", request, 
response);
+            if (response.getStatus().getCode() != Cloud.MetaServiceCode.OK) {
+                LOG.warn("notify to resume cluster not ok, cluster {}, 
response: {}", clusterName, response);
+            }
+            LOG.info("notify to resume cluster {}, response: {} ", 
clusterName, response);
+        } catch (RpcException e) {
+            LOG.warn("failed to notify to resume cluster {}", clusterName, e);
+            throw new DdlException("notify to resume cluster not ok");
+        }
+    }
+
+    /**
+     * Wait for cluster to resume to NORMAL status with alive backends.
+     * @param clusterName the name of the cluster
+     * @param retryTimes maximum number of retry attempts
+     * @param initialClusterStatus the initial cluster status
+     * @throws DdlException if the cluster fails to resume within the retry 
limit
+     */
+    private void waitForClusterToResume(String clusterName, int retryTimes, 
String initialClusterStatus)
+            throws DdlException {
         int retryTime = 0;
         StopWatch stopWatch = new StopWatch();
         stopWatch.start();
         boolean hasAutoStart = false;
         boolean existAliveBe = true;
-        while 
((!String.valueOf(Cloud.ClusterStatus.NORMAL).equals(clusterStatus) || 
!existAliveBe)
+        String clusterStatusStr = initialClusterStatus;
+        Cloud.ClusterStatus clusterStatus = 
parseClusterStatusOrNull(clusterStatusStr, clusterName);
+        Cloud.ClusterStatus lastLoggedStatus = clusterStatus;
+        boolean lastLoggedExistAliveBe = existAliveBe;
+
+        while ((clusterStatus != Cloud.ClusterStatus.NORMAL || !existAliveBe)
             && retryTime < retryTimes) {
             hasAutoStart = true;
             ++retryTime;
             // sleep random millis [0.5, 1] s
-            int randomSeconds =  500 + (int) (Math.random() * (1000 - 500));
-            LOG.info("resume cluster {} retry times {}, wait randomMillis: {}, 
current status: {}",
-                    clusterName, retryTime, randomSeconds, clusterStatus);
+            int sleepMs = ThreadLocalRandom.current().nextInt(500, 1001);
             try {
                 if (retryTime > retryTimes / 2) {
                     // sleep random millis [1, 1.5] s
-                    randomSeconds =  1000 + (int) (Math.random() * (1000 - 
500));
+                    sleepMs = ThreadLocalRandom.current().nextInt(1000, 1501);
                 }
-                Thread.sleep(randomSeconds);
+                Thread.sleep(sleepMs);
             } catch (InterruptedException e) {
                 LOG.info("change cluster sleep wait InterruptedException: ", 
e);
             }
-            clusterStatus = getCloudStatusByName(clusterName);
+            clusterStatusStr = getCloudStatusByName(clusterName);
+            clusterStatus = parseClusterStatusOrNull(clusterStatusStr, 
clusterName);
             // Check that the bes node in the cluster have at least one alive
             existAliveBe = 
getBackendsByClusterName(clusterName).stream().anyMatch(Backend::isAlive);
+
+            // Reduce log spam: log when status changes / alive-be changes / 
every 10 retries
+            boolean statusChanged = lastLoggedStatus != clusterStatus;
+            boolean aliveChanged = lastLoggedExistAliveBe != existAliveBe;
+            boolean periodicLog = (retryTime % 10 == 0);
+            if (statusChanged || aliveChanged || periodicLog) {

Review Comment:
   **Good improvement**: The log optimization to reduce spam (logging only on 
status changes, alive-be changes, or every 10 retries) is excellent. This will 
significantly reduce log noise during auto-start operations.



##########
fe/fe-core/src/main/java/org/apache/doris/common/profile/SummaryProfile.java:
##########
@@ -145,10 +145,19 @@ public class SummaryProfile {
     public static final String SPLITS_ASSIGNMENT_WEIGHT = "Splits Assignment 
Weight";
     public static final String ICEBERG_SCAN_METRICS = "Iceberg Scan Metrics";
     public static final String PAIMON_SCAN_METRICS = "Paimon Scan Metrics";
+    private boolean isWarmUp = false;

Review Comment:
   **Minor**: The field name `isWarmUp` uses camelCase with capital U, but the 
getter/setter use `isWarmup()` and `setWarmup()`. Consider renaming the field 
to `isWarmup` for consistency, or adjust the method names to match the field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](cloud) Fix auto-start functionality when encountering TVF and external queries [doris]

Reply via email to