GitHub user logannc opened a pull request:
https://github.com/apache/spark/pull/18945
Add option to convert nullable int columns to float columns in toPandâ¦
â¦as to prevent needless Exceptions during routine use.
Add the `strict=True` kwarg to DataFrame.toPandas to allow for a non-strict
interpretation of the schema of a dataframe. This is currently limited to
allowing a nullable int column to being interpreted as a float column (because
that is the only way Pandas supports nullable int columns and actually crashes
without this).
I consider this small change to be a massive quality of life improvement
for DataFrames with lots of nullable int columns, which would otherwise need a
litany of `df.withColumn(name, F.col(name).cast(DoubleType()))`, etc, just to
view them easily or interact with them in-memory.
**Possible Objections**
* I foresee concerns with the name of the kwarg, of which I am open to
suggestions.
* I also foresee possible objections due to the potential for needless
conversion of nullable int columns to floats when there are actually no null
values. I would counter those objections by noting that it only occurs when
strict=False, which is not the default, and can be avoided on a per-column
basis by setting the `nullable` property of the schema to False.
**Alternatives**
* Rename the kwarg to be specific to the current change. i.e.,
`nullable_int_to_float` instead of `strict` or some other, similar name.
* Fix Pandas to allow nullable int columns. (Very difficult, per Wes
McKinney, due to lack of NumPy support.
https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/logannc/spark nullable_int_pandas
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18945.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18945
----
commit bceeefca77dd3414e4ec97ad3570043ec3ce3059
Author: Logan Collins <[email protected]>
Date: 2017-08-15T01:30:08Z
Add option to convert nullable int columns to float columns in toPandas to
prevent needless crashes.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]