Github user patrick-nicholson commented on the issue:
https://github.com/apache/spark/pull/17926
> It seems adding a functionality and not a trivial fix. I think we need a
JIRA.
It's up to you. All I'm doing is passing a keyword argument from one
preexisting public method to another. I don't view that as adding
functionality, but I am not the arbiter of such things.
> I think this is a rather niche case and we can workaround by
parallelizing outside.
It has been a rather common case for me since I'm often working with
`pandas.DataFrame`s of millions of rows with many columns of mixed types (where
any numeric types are implicitly `numpy` types, rather than base). It can be
worked outside by manually performing the steps inside of `createDataFrame`:
```
df = spark.createDataFrame(spark.sparkContext.parallelize([r.tolist() for r
in pandas_df.to_records(index=False)], numSlices=5)), schema=[str(_) for _ in
pandas_df.columns])
```
Again, I don't see the proposed change as adding any functionality, just
exposing machinery already in place for distributing Python data to an `RDD` in
a consistent way.
> Also, this looks only applying when the data is not RDD. I think this is
confusing if a user sets this and this option is not working in some cases
unless the user reads the documentation.
Given that `RDD` and local data are necessarily different and that
`createDataFrame` already has separate code paths for `RDD` and local Python
data, I don't know how this can be avoided.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]