[GitHub] spark pull request #20534: [SPARK-23319][TESTS][BRANCH-2.3] Explicitly speci...

HyukjinKwon Wed, 07 Feb 2018 06:36:08 -0800

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20534


    [SPARK-23319][TESTS][BRANCH-2.3] Explicitly specify Pandas and PyArrow 
versions in PySpark tests (to skip or test)

    This PR backports https://github.com/apache/spark/pull/20487 to branch-2.3.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark 
PR_TOOL_PICK_PR_20487_BRANCH-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20534.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20534
    
----
commit ff9ba5eb840bcd843c5201e23589e8cbb5009c53
Author: hyukjinkwon <gurwls223@...>
Date:   2018-02-07T14:28:10Z

    [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in 
PySpark tests (to skip or test)
    
    This PR proposes to explicitly specify Pandas and PyArrow versions in 
PySpark tests to skip or test.
    
    We declared the extra dependencies:
    
    
https://github.com/apache/spark/blob/b8bfce51abf28c66ba1fc67b0f25fe1617c81025/python/setup.py#L204
    
    In case of PyArrow:
    
    Currently we only check if pyarrow is installed or not without checking the 
version. It already fails to run tests. For example, if PyArrow 0.7.0 is 
installed:
    
    ```
    ======================================================================
    ERROR: test_vectorized_udf_wrong_return_type 
(pyspark.sql.tests.ScalarPandasUDF)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/.../spark/python/pyspark/sql/tests.py", line 4019, in 
test_vectorized_udf_wrong_return_type
        f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType()))
      File "/.../spark/python/pyspark/sql/functions.py", line 2309, in 
pandas_udf
        return _create_udf(f=f, returnType=return_type, evalType=eval_type)
      File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf
        require_minimum_pyarrow_version()
      File "/.../spark/python/pyspark/sql/utils.py", line 132, in 
require_minimum_pyarrow_version
        "however, your version was %s." % pyarrow.__version__)
    ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; 
however, your version was 0.7.0.
    
    ----------------------------------------------------------------------
    Ran 33 tests in 8.098s
    
    FAILED (errors=33)
    ```
    
    In case of Pandas:
    
    There are few tests for old Pandas which were tested only when Pandas 
version was lower, and I rewrote them to be tested when both Pandas version is 
lower and missing.
    
    Manually tested by modifying the condition:
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 
0.19.2.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 
0.19.2.'
    test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; 
however, your version was 0.19.2.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
    test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; 
however, it was not found.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 
0.8.0.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 
0.8.0.'
    test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; 
however, your version was 0.8.0.'
    ```
    
    ```
    test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
    test_createDataFrame_respect_session_timezone 
(pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; 
however, it was not found.'
    ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.
    
    (cherry picked from commit 71cfba04aeec5ae9b85a507b13996e80f8750edc)
    
    Signed-off-by: hyukjinkwon <[email protected]>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20534: [SPARK-23319][TESTS][BRANCH-2.3] Explicitly speci...

Reply via email to