Re: [PR] [CH] Support skewness aggregate function [gluten]

via GitHub Tue, 23 Jun 2026 18:33:30 -0700


Copilot commented on code in PR #12294:
URL: https://github.com/apache/gluten/pull/12294#discussion_r3464009565



##########
backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenClickhouseCountDistinctSuite.scala:
##########
@@ -100,12 +100,13 @@ class GlutenClickhouseCountDistinctSuite extends 
GlutenClickHouseWholeStageTrans
   }
 
   test("check count distinct with agg fallback") {
-    // skewness agg is not supported, will cause fallback
     val sql = "select count(distinct(a,b)) , skewness(b) from " +
       "values (0, null,1), (0,null,1), (1, 1,1), (2, 2, 1) ,(2,2,2),(3,3,3) as 
data(a,b,c)"
-    assertThrows[UnsupportedOperationException] {
-      spark.sql(sql).show
-    }
+    compareResultsAgainstVanillaSpark(sql, true, { _ => })
+
+    val sqlWithKeys = "select a, count(distinct(b)) , skewness(b) from " +
+      "values (0, null,1), (0,null,1), (1, 1,1), (2, 2, 1) ,(2,2,2),(3,3,3) as 
data(a,b,c) group by a"
+    compareResultsAgainstVanillaSpark(sqlWithKeys, true, { _ => })
   }

Review Comment:
   This test now validates that `skewness` can run natively (no fallback), but 
it doesn’t cover the new NaN→NULL behavior described in the PR (fewer than 3 
non-null values). Adding an explicit query with only 2 non-null rows would 
prevent regressions and also makes the test name reflect the new intent (no 
fallback anymore).



##########
cpp-ch/local-engine/Parser/aggregate_function_parser/SimpleStatisticsFunctions.cpp:
##########
@@ -63,7 +63,14 @@ class AggregateFunctionParserStddev final : public 
AggregateFunctionParser
         return func_node;
     }
 };
+// for skewness
+struct SkewnessNameStruct
+{
+    static constexpr auto spark_name = "skewness";
+    static constexpr auto ch_name = "skewSamp";
+};
 
+static const 
AggregateFunctionParserRegister<AggregateFunctionParserStddev<SkewnessNameStruct>>
 registerer_skewness;

Review Comment:
   `skewness` is registered using `AggregateFunctionParserStddev`, whose 
`convertNodeTypeIfNeeded` unconditionally converts NaN to NULL. This breaks 
Spark’s `SQLConf.LEGACY_STATISTICAL_AGGREGATE=true` behavior (Spark expects 
NaN, not NULL; see `gluten-ut/.../GlutenSQLAggregateFunctionSuite.scala`). 
Consider honoring the Spark conf and only doing NaN→NULL conversion when legacy 
mode is off (still keeping a nullable output type).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [CH] Support skewness aggregate function [gluten]

Reply via email to