GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/19319
[SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ValueError with
nullable int columns
## What changes were proposed in this pull request?
When calling `DataFrame.toPandas()` (without Arrow enabled), if there is a
`IntegralType` column (`IntegerType`, `ShortType`, `ByteType`) that has null
values the following exception is thrown:
ValueError: Cannot convert non-finite values (NA or inf) to integer
This is because the null values first get converted to float NaN during the
construction of the Pandas DataFrame in `from_records`, and then it is
attempted to be converted back to to an integer where it fails.
The fix is going to check if the Pandas DataFrame can cause such failure
when converting, if so, we don't do the conversion and use the inferred type by
Pandas.
## How was this patch tested?
Added pyspark test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-21766
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19319.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19319
----
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]