GitHub user logannc opened a pull request:

    https://github.com/apache/spark/pull/18945

    Add option to convert nullable int columns to float columns in toPand…

    …as to prevent needless Exceptions during routine use.
    
    Add the `strict=True` kwarg to DataFrame.toPandas to allow for a non-strict 
interpretation of the schema of a dataframe. This is currently limited to 
allowing a nullable int column to being interpreted as a float column (because 
that is the only way Pandas supports nullable int columns and actually crashes 
without this).
    
    I consider this small change to be a massive quality of life improvement 
for DataFrames with lots of nullable int columns, which would otherwise need a 
litany of `df.withColumn(name, F.col(name).cast(DoubleType()))`, etc, just to 
view them easily or interact with them in-memory.
    
    **Possible Objections**
    * I foresee concerns with the name of the kwarg, of which I am open to 
suggestions.
    * I also foresee possible objections due to the potential for needless 
conversion of nullable int columns to floats when there are actually no null 
values. I would counter those objections by noting that it only occurs when 
strict=False, which is not the default, and can be avoided on a per-column 
basis by setting the `nullable` property of the schema to False. 
    
    **Alternatives**
    * Rename the kwarg to be specific to the current change. i.e., 
`nullable_int_to_float` instead of `strict` or some other, similar name.
    * Fix Pandas to allow nullable int columns. (Very difficult, per Wes 
McKinney, due to lack of NumPy support. 
https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/logannc/spark nullable_int_pandas

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18945.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18945
    
----
commit bceeefca77dd3414e4ec97ad3570043ec3ce3059
Author: Logan Collins <[email protected]>
Date:   2017-08-15T01:30:08Z

    Add option to convert nullable int columns to float columns in toPandas to 
prevent needless crashes.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to