(spark) branch master updated: [MINOR][PYTHON][TESTS] De-flake unit tests of `collect_set` and `collect_list` in `test_connect_function`

gurwls223 Wed, 16 Jul 2025 00:54:58 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 5cd50eb6d3c7 [MINOR][PYTHON][TESTS] De-flake unit tests of 
`collect_set` and `collect_list` in `test_connect_function`
5cd50eb6d3c7 is described below

commit 5cd50eb6d3c7f05280749e022b05b24de51b274e
Author: Ruifeng Zheng <ruife...@apache.org>
AuthorDate: Wed Jul 16 16:54:42 2025 +0900

    [MINOR][PYTHON][TESTS] De-flake unit tests of `collect_set` and 
`collect_list` in `test_connect_function`
    
    ### What changes were proposed in this pull request?
    De-flake unit tests of `collect_set` and `collect_list`, by sorting the 
output arrays
    
    ### Why are the changes needed?
    the tests occasionally fail with
    ```
    AssertionError: DataFrame.iloc[:, 0] (column name="r1") are different
    DataFrame.iloc[:, 0] (column name="r1") values are different (100.0 %)
    [index]: [0]
    [left]:  [[nan, 2.1, 0.5]]
    [right]: [[2.1, nan, 0.5]]
    At positional index 0, first diff: [nan 2.1 0.5] != [2.1 nan 0.5]
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    no, test-only
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #51511 from zhengruifeng/fix_agg_test_2.
    
    Authored-by: Ruifeng Zheng <ruife...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 .../sql/tests/connect/test_connect_function.py      | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index 56c169ecb029..b906f5c5cef4 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -551,8 +551,6 @@ class SparkConnectFunctionTests(ReusedMixedTestCase, 
PandasOnSparkTestUtils):
             (CF.approx_count_distinct, SF.approx_count_distinct),
             (CF.approxCountDistinct, SF.approxCountDistinct),
             (CF.avg, SF.avg),
-            (CF.collect_list, SF.collect_list),
-            (CF.collect_set, SF.collect_set),
             (CF.listagg, SF.listagg),
             (CF.listagg_distinct, SF.listagg_distinct),
             (CF.string_agg, SF.string_agg),
@@ -589,6 +587,25 @@ class SparkConnectFunctionTests(ReusedMixedTestCase, 
PandasOnSparkTestUtils):
                 check_exact=False,
             )
 
+        for cfunc, sfunc in [
+            (CF.collect_list, SF.collect_list),
+            (CF.collect_set, SF.collect_set),
+        ]:
+            self.assert_eq(
+                cdf.select(CF.sort_array(cfunc("b")), 
CF.sort_array(cfunc(cdf.c))).toPandas(),
+                sdf.select(SF.sort_array(sfunc("b")), 
SF.sort_array(sfunc(sdf.c))).toPandas(),
+                check_exact=False,
+            )
+            self.assert_eq(
+                cdf.groupBy("a")
+                .agg(CF.sort_array(cfunc("b")), CF.sort_array(cfunc(cdf.c)))
+                .toPandas(),
+                sdf.groupBy("a")
+                .agg(SF.sort_array(sfunc("b")), SF.sort_array(sfunc(sdf.c)))
+                .toPandas(),
+                check_exact=False,
+            )
+
         for cfunc, sfunc in [
             (CF.corr, SF.corr),
             (CF.covar_pop, SF.covar_pop),


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [MINOR][PYTHON][TESTS] De-flake unit tests of `collect_set` and `collect_list` in `test_connect_function`

Reply via email to