This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 5cd50eb6d3c7 [MINOR][PYTHON][TESTS] De-flake unit tests of `collect_set` and `collect_list` in `test_connect_function` 5cd50eb6d3c7 is described below commit 5cd50eb6d3c7f05280749e022b05b24de51b274e Author: Ruifeng Zheng <ruife...@apache.org> AuthorDate: Wed Jul 16 16:54:42 2025 +0900 [MINOR][PYTHON][TESTS] De-flake unit tests of `collect_set` and `collect_list` in `test_connect_function` ### What changes were proposed in this pull request? De-flake unit tests of `collect_set` and `collect_list`, by sorting the output arrays ### Why are the changes needed? the tests occasionally fail with ``` AssertionError: DataFrame.iloc[:, 0] (column name="r1") are different DataFrame.iloc[:, 0] (column name="r1") values are different (100.0 %) [index]: [0] [left]: [[nan, 2.1, 0.5]] [right]: [[2.1, nan, 0.5]] At positional index 0, first diff: [nan 2.1 0.5] != [2.1 nan 0.5] ``` ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #51511 from zhengruifeng/fix_agg_test_2. Authored-by: Ruifeng Zheng <ruife...@apache.org> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- .../sql/tests/connect/test_connect_function.py | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py b/python/pyspark/sql/tests/connect/test_connect_function.py index 56c169ecb029..b906f5c5cef4 100644 --- a/python/pyspark/sql/tests/connect/test_connect_function.py +++ b/python/pyspark/sql/tests/connect/test_connect_function.py @@ -551,8 +551,6 @@ class SparkConnectFunctionTests(ReusedMixedTestCase, PandasOnSparkTestUtils): (CF.approx_count_distinct, SF.approx_count_distinct), (CF.approxCountDistinct, SF.approxCountDistinct), (CF.avg, SF.avg), - (CF.collect_list, SF.collect_list), - (CF.collect_set, SF.collect_set), (CF.listagg, SF.listagg), (CF.listagg_distinct, SF.listagg_distinct), (CF.string_agg, SF.string_agg), @@ -589,6 +587,25 @@ class SparkConnectFunctionTests(ReusedMixedTestCase, PandasOnSparkTestUtils): check_exact=False, ) + for cfunc, sfunc in [ + (CF.collect_list, SF.collect_list), + (CF.collect_set, SF.collect_set), + ]: + self.assert_eq( + cdf.select(CF.sort_array(cfunc("b")), CF.sort_array(cfunc(cdf.c))).toPandas(), + sdf.select(SF.sort_array(sfunc("b")), SF.sort_array(sfunc(sdf.c))).toPandas(), + check_exact=False, + ) + self.assert_eq( + cdf.groupBy("a") + .agg(CF.sort_array(cfunc("b")), CF.sort_array(cfunc(cdf.c))) + .toPandas(), + sdf.groupBy("a") + .agg(SF.sort_array(sfunc("b")), SF.sort_array(sfunc(sdf.c))) + .toPandas(), + check_exact=False, + ) + for cfunc, sfunc in [ (CF.corr, SF.corr), (CF.covar_pop, SF.covar_pop), --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org