Copilot commented on code in PR #12294:
URL: https://github.com/apache/gluten/pull/12294#discussion_r3464009565
##########
backends-clickhouse/src/test/scala/org/apache/gluten/execution/GlutenClickhouseCountDistinctSuite.scala:
##########
@@ -100,12 +100,13 @@ class GlutenClickhouseCountDistinctSuite extends
GlutenClickHouseWholeStageTrans
}
test("check count distinct with agg fallback") {
- // skewness agg is not supported, will cause fallback
val sql = "select count(distinct(a,b)) , skewness(b) from " +
"values (0, null,1), (0,null,1), (1, 1,1), (2, 2, 1) ,(2,2,2),(3,3,3) as
data(a,b,c)"
- assertThrows[UnsupportedOperationException] {
- spark.sql(sql).show
- }
+ compareResultsAgainstVanillaSpark(sql, true, { _ => })
+
+ val sqlWithKeys = "select a, count(distinct(b)) , skewness(b) from " +
+ "values (0, null,1), (0,null,1), (1, 1,1), (2, 2, 1) ,(2,2,2),(3,3,3) as
data(a,b,c) group by a"
+ compareResultsAgainstVanillaSpark(sqlWithKeys, true, { _ => })
}
Review Comment:
This test now validates that `skewness` can run natively (no fallback), but
it doesn’t cover the new NaN→NULL behavior described in the PR (fewer than 3
non-null values). Adding an explicit query with only 2 non-null rows would
prevent regressions and also makes the test name reflect the new intent (no
fallback anymore).
##########
cpp-ch/local-engine/Parser/aggregate_function_parser/SimpleStatisticsFunctions.cpp:
##########
@@ -63,7 +63,14 @@ class AggregateFunctionParserStddev final : public
AggregateFunctionParser
return func_node;
}
};
+// for skewness
+struct SkewnessNameStruct
+{
+ static constexpr auto spark_name = "skewness";
+ static constexpr auto ch_name = "skewSamp";
+};
+static const
AggregateFunctionParserRegister<AggregateFunctionParserStddev<SkewnessNameStruct>>
registerer_skewness;
Review Comment:
`skewness` is registered using `AggregateFunctionParserStddev`, whose
`convertNodeTypeIfNeeded` unconditionally converts NaN to NULL. This breaks
Spark’s `SQLConf.LEGACY_STATISTICAL_AGGREGATE=true` behavior (Spark expects
NaN, not NULL; see `gluten-ut/.../GlutenSQLAggregateFunctionSuite.scala`).
Consider honoring the Spark conf and only doing NaN→NULL conversion when legacy
mode is off (still keeping a nullable output type).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]