Re: [PR] [improvement](statistics) Collect hot value during full statistics analysis [doris]

via GitHub Tue, 12 May 2026 02:15:31 -0700


yujun777 commented on code in PR #62435:
URL: https://github.com/apache/doris/pull/62435#discussion_r3225170397



##########
regression-test/suites/statistics/test_hot_value.groovy:
##########
@@ -81,28 +81,31 @@ suite("test_hot_value") {
     wait_row_count_reported("test_hot_value", "test1", 0, 4, "10000")
     wait_row_count_reported("test_hot_value", "test2", 0, 4, "10000")
     sql """analyze table test1 with sync"""
-    logger.info("1. memo plan ")
-    explain {
-        sql("memo plan select * from test1")
-        contains "hotValues=(null)"
-    }
     def result = sql """show column stats test1(key1)"""
     assertEquals(1, result.size())
     assertEquals("10000.0", result[0][2])
-    assertEquals("null", result[0][17])
+    assertTrue(result[0][17] != "null")
     result = sql """show column stats test1(value1)"""
     logger.info("0. result " + result)
     assertEquals(1, result.size())
     assertEquals("10000.0", result[0][2])
-    assertEquals("null", result[0][17])
+    String[] fullHotValues = result[0][17].split(";")

Review Comment:
   Added an end-to-end memo plan assertion in test_hot_value.groovy after full 
analyze, so the regression now verifies that collected hot values are visible 
in optimizer memo statistics and not just stored in the stats table.
   
   Verified with ./run-regression-test.sh --run -d statistics -s test_hot_value.



##########
regression-test/suites/statistics/test_hot_value.groovy:
##########
@@ -81,28 +81,31 @@ suite("test_hot_value") {
     wait_row_count_reported("test_hot_value", "test1", 0, 4, "10000")
     wait_row_count_reported("test_hot_value", "test2", 0, 4, "10000")
     sql """analyze table test1 with sync"""
-    logger.info("1. memo plan ")
-    explain {
-        sql("memo plan select * from test1")
-        contains "hotValues=(null)"
-    }
     def result = sql """show column stats test1(key1)"""
     assertEquals(1, result.size())
     assertEquals("10000.0", result[0][2])
-    assertEquals("null", result[0][17])
+    assertTrue(result[0][17] != "null")
     result = sql """show column stats test1(value1)"""
     logger.info("0. result " + result)
     assertEquals(1, result.size())
     assertEquals("10000.0", result[0][2])
-    assertEquals("null", result[0][17])
+    String[] fullHotValues = result[0][17].split(";")
+    logger.info("0.1 fullHotValues " + result[0][17])
+    assertEquals(2, fullHotValues.length)
+    assertTrue(fullHotValues[0].trim() == "'1':0.5" || fullHotValues[0].trim() 
== "'0':0.5")
+    assertTrue(fullHotValues[1].trim() == "'1':0.5" || fullHotValues[1].trim() 
== "'0':0.5")
     result = sql """show column cached stats test1(key1)"""
     assertEquals(1, result.size())
     assertEquals("10000.0", result[0][2])
+    // cached stats filter hot values by threshold; key1 with 10000 unique 
values has ratio~0, so cached hot_value is null
     assertEquals("null", result[0][17])

Review Comment:
   Fixed in b1842dc22b0. The cached-stats assertion now only checks that 
hot_value is present instead of expecting null, which matches the FE 
deserialization/cache path.
   
   Verified with ./run-regression-test.sh --run -d statistics -s test_hot_value.



##########
fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java:
##########
@@ -87,10 +90,20 @@ public abstract class BaseAnalysisTask {
             +     "SUBSTRING(CAST(MIN(`${colName}`) AS STRING), 1, 1024) AS 
`min`, "
             +     "SUBSTRING(CAST(MAX(`${colName}`) AS STRING), 1, 1024) AS 
`max`, "
             +     "${dataSizeFunction} AS `data_size`, "
-            +     "NOW() AS `update_time`, "
-            +     "null as `hot_value` "
-            + "FROM (SELECT `${colName}`${lengthAssert} "
-            +     "FROM `${catalogName}`.`${dbName}`.`${tblName}` ${index}) 
__lc_t";
+            +     "NOW() "
+            + "FROM cte1), "
+            + "cte3 AS ("

Review Comment:
   Fixed in ae8282760b8. FULL_ANALYZE_TEMPLATE now mirrors the sample templates 
by wrapping the cte3 GROUP_CONCAT result with IFNULL(..., ), so empty/all-NULL 
columns persist an empty hot_value string instead of SQL NULL. I also updated 
the related FE UT expectations in OlapAnalysisTaskTest and HMSAnalysisTaskTest.
   
   Verified with:
   - mvn -Pcoverage test -Dcheckstyle.skip=true -DfailIfNoTests=false 
-Dmaven.build.cache.enabled=false 
-Dtest=org.apache.doris.statistics.OlapAnalysisTaskTest,org.apache.doris.statistics.HMSAnalysisTaskTest
   - ./run-regression-test.sh --run -d statistics -s test_hot_value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [improvement](statistics) Collect hot value during full statistics analysis [doris]

Reply via email to