GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20567

    [SPARK-23380][PYTHON] Make toPandas fall back to Arrow optimization 
disabled when schema is not supported in Arrow optimization

    ## What changes were proposed in this pull request?
    
    This PR proposes to fall back to one without Arrow when schema is not 
supported in Arrow optimisation.
    
    ```python
    df = spark.createDataFrame([[{'a': 1}]])
    
    spark.conf.set("spark.sql.execution.arrow.enabled", "false")
    df.toPandas()
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    df.toPandas()
    ```
    
    **Before**
    
    ```
    ...
    py4j.protocol.Py4JJavaError: An error occurred while calling 
o42.collectAsArrowToPython.
    ...
    java.lang.UnsupportedOperationException: Unsupported data type: 
map<string,bigint>
    ```
    
    **After**
    
    ```
    ...
              _1
    0  {u'a': 1}
    
    ... UserWarning: Arrow will not be used in toPandas: Unsupported type in 
conversion to Arrow: MapType(StringType,LongType,true)
    ...
              _1
    0  {u'a': 1}
    ```
    
    Note that, in case of `createDataFrame`, we already fall back to make this 
at least working even though the optimisation is disabled: 
    
    ```python
    df = spark.createDataFrame([[{'a': 1}]])
    spark.conf.set("spark.sql.execution.arrow.enabled", "false")
    pdf = df.toPandas()
    spark.createDataFrame(pdf).show()
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    spark.createDataFrame(pdf).show()
    ```
    
    ```
    ...
    ... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
Arrow type ...
    +--------+
    |      _1|
    +--------+
    |[a -> 1]|
    +--------+
    ```
    
    
    ## How was this patch tested?
    
    Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark pandas_conversion_cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20567.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20567
    
----
commit d87547c05c0ab874dfce8e6ddca4ee454926b664
Author: hyukjinkwon <gurwls223@...>
Date:   2018-02-09T03:40:41Z

    toPandas conversion cleanup

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to