(spark) branch master updated: [SPARK-46735][PYTHON][TESTS] `pyspark.sql.tests.test_group` should skip Pandas/PyArrow tests if not available

dongjoon Tue, 16 Jan 2024 14:01:24 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 6c4747cdfb02 [SPARK-46735][PYTHON][TESTS] 
`pyspark.sql.tests.test_group` should skip Pandas/PyArrow tests if not available
6c4747cdfb02 is described below

commit 6c4747cdfb02b5ff7197f2e8b55a79a4ac082531
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Tue Jan 16 14:01:07 2024 -0800

    [SPARK-46735][PYTHON][TESTS] `pyspark.sql.tests.test_group` should skip 
Pandas/PyArrow tests if not available
    
    ### What changes were proposed in this pull request?
    
    This PR aims to skip `Pandas`-related  or `PyArrow`-related tests in 
`pyspark.sql.tests.test_group` if they are not installed.
    
    This regression was introduced by
    - #44322
    - #42767
    
    ### Why are the changes needed?
    
    Since `Pandas` and `PyArrow` are optional, we need to skip the tests 
instead of failures.
    - https://github.com/apache/spark/actions/runs/7543495430/job/20534809039
    
    ```
    ======================================================================
    ERROR: test_agg_func (pyspark.sql.tests.test_group.GroupTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File 
"/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/pandas/utils.py", line 
28, in require_minimum_pandas_version
        import pandas
    ModuleNotFoundError: No module named 'pandas'
    ```
    
    ```
    ======================================================================
    ERROR: test_agg_func (pyspark.sql.tests.test_group.GroupTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/__w/spark/spark/python/pyspark/sql/pandas/utils.py", line 61, in 
require_minimum_pyarrow_version
        import pyarrow
    ModuleNotFoundError: No module named 'pyarrow'
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    - Manually with the Python installation without Pandas.
    ```
    $ python/run-tests.py --testnames pyspark.sql.tests.test_group
    Running PySpark tests. Output is in 
/Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
    Will test against the following Python executables: ['python3.9', 'pypy3']
    Will test the following Python tests: ['pyspark.sql.tests.test_group']
    python3.9 python_implementation is CPython
    python3.9 version is: Python 3.9.18
    pypy3 python_implementation is PyPy
    pypy3 version is: Python 3.10.13 (f1607341da97ff5a1e93430b6e8c4af0ad1aa019, 
Sep 28 2023, 20:47:55)
    [PyPy 7.3.13 with GCC Apple LLVM 13.1.6 (clang-1316.0.21.2.5)]
    Starting test(python3.9): pyspark.sql.tests.test_group (temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/ac9269b6-f0df-4d06-88b8-e5e710202b60/python3.9__pyspark.sql.tests.test_group__9zjp5i4z.log)
    Starting test(pypy3): pyspark.sql.tests.test_group (temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/cab6ebed-e49f-4d86-80db-0dc3928079e3/pypy3__pyspark.sql.tests.test_group__thw6hily.log)
    Finished test(pypy3): pyspark.sql.tests.test_group (6s) ... 3 tests were 
skipped
    Finished test(python3.9): pyspark.sql.tests.test_group (7s) ... 3 tests 
were skipped
    Tests passed in 7 seconds
    
    Skipped tests in pyspark.sql.tests.test_group with pypy3:
        test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... skipped 
'[PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it was not 
found.'
        test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
skipped '[PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it 
was not found.'
        test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
skipped '[PACKAGE_NOT_INSTALLED] Pandas >= 1.4.4 must be installed; however, it 
was not found.'
    
    Skipped tests in pyspark.sql.tests.test_group with python3.9:
          test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... SKIP 
(0.000s)
          test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
SKIP (0.000s)
          test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
SKIP (0.000s)
    ```
    
    - Manually with the Python installation without Pyarrow.
    ```
    $ python/run-tests.py --testnames pyspark.sql.tests.test_group
    Running PySpark tests. Output is in 
/Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
    Will test against the following Python executables: ['python3.9', 'pypy3']
    Will test the following Python tests: ['pyspark.sql.tests.test_group']
    python3.9 python_implementation is CPython
    python3.9 version is: Python 3.9.18
    pypy3 python_implementation is PyPy
    pypy3 version is: Python 3.10.13 (f1607341da97ff5a1e93430b6e8c4af0ad1aa019, 
Sep 28 2023, 20:47:55)
    [PyPy 7.3.13 with GCC Apple LLVM 13.1.6 (clang-1316.0.21.2.5)]
    Starting test(pypy3): pyspark.sql.tests.test_group (temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/7f1a665e-a679-467c-8ab4-a4532e0b2300/pypy3__pyspark.sql.tests.test_group__i67erhb4.log)
    Starting test(python3.9): pyspark.sql.tests.test_group (temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/47b90765-8ad7-4da0-aa7b-c12cd266847e/python3.9__pyspark.sql.tests.test_group__190hx0tm.log)
    Finished test(python3.9): pyspark.sql.tests.test_group (6s) ... 3 tests 
were skipped
    Finished test(pypy3): pyspark.sql.tests.test_group (7s) ... 3 tests were 
skipped
    Tests passed in 7 seconds
    
    Skipped tests in pyspark.sql.tests.test_group with pypy3:
        test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... skipped 
'[PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, it was 
not found.'
        test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
skipped '[PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, 
it was not found.'
        test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
skipped '[PACKAGE_NOT_INSTALLED] PyArrow >= 4.0.0 must be installed; however, 
it was not found.'
    
    Skipped tests in pyspark.sql.tests.test_group with python3.9:
          test_agg_func (pyspark.sql.tests.test_group.GroupTests) ... SKIP 
(0.000s)
          test_group_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
SKIP (0.000s)
          test_order_by_ordinal (pyspark.sql.tests.test_group.GroupTests) ... 
SKIP (0.000s)
    ```
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #44759 from dongjoon-hyun/SPARK-46735.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 python/pyspark/sql/tests/test_group.py | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_group.py 
b/python/pyspark/sql/tests/test_group.py
index 6c84bd740171..1a9b7d9d836c 100644
--- a/python/pyspark/sql/tests/test_group.py
+++ b/python/pyspark/sql/tests/test_group.py
@@ -14,14 +14,23 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+import unittest
 
 from pyspark.sql import Row
 from pyspark.sql import functions as sf
-from pyspark.testing.sqlutils import ReusedSQLTestCase
+from pyspark.testing.sqlutils import (
+    ReusedSQLTestCase,
+    have_pandas,
+    have_pyarrow,
+    pandas_requirement_message,
+    pyarrow_requirement_message,
+)
 from pyspark.testing import assertDataFrameEqual, assertSchemaEqual
 
 
 class GroupTestsMixin:
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: 
ignore
+    @unittest.skipIf(not have_pyarrow, pyarrow_requirement_message)  # type: 
ignore
     def test_agg_func(self):
         data = [Row(key=1, value=10), Row(key=1, value=20), Row(key=1, 
value=30)]
         df = self.spark.createDataFrame(data)
@@ -60,6 +69,8 @@ class GroupTestsMixin:
         # test deprecated countDistinct
         self.assertEqual(100, 
g.agg(functions.countDistinct(df.value)).first()[0])
 
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: 
ignore
+    @unittest.skipIf(not have_pyarrow, pyarrow_requirement_message)  # type: 
ignore
     def test_group_by_ordinal(self):
         spark = self.spark
         df = spark.createDataFrame(
@@ -119,6 +130,8 @@ class GroupTestsMixin:
             with self.assertRaises(IndexError):
                 df.groupBy(10).agg(sf.sum("b"))
 
+    @unittest.skipIf(not have_pandas, pandas_requirement_message)  # type: 
ignore
+    @unittest.skipIf(not have_pyarrow, pyarrow_requirement_message)  # type: 
ignore
     def test_order_by_ordinal(self):
         spark = self.spark
         df = spark.createDataFrame(


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-46735][PYTHON][TESTS] `pyspark.sql.tests.test_group` should skip Pandas/PyArrow tests if not available

Reply via email to