GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/19319

    [SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ValueError with 
nullable int columns

    ## What changes were proposed in this pull request?
    
    When calling `DataFrame.toPandas()` (without Arrow enabled), if there is a 
`IntegralType` column (`IntegerType`, `ShortType`, `ByteType`) that has null 
values the following exception is thrown:
    
        ValueError: Cannot convert non-finite values (NA or inf) to integer
    
    This is because the null values first get converted to float NaN during the 
construction of the Pandas DataFrame in `from_records`, and then it is 
attempted to be converted back to to an integer where it fails.
    
    The fix is going to check if the Pandas DataFrame can cause such failure 
when converting, if so, we don't do the conversion and use the inferred type by 
Pandas.
    
    ## How was this patch tested?
    
    Added pyspark test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-21766

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19319.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19319
    
----

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to