[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

patrick-nicholson Wed, 10 May 2017 07:28:20 -0700

Github user patrick-nicholson commented on the issue:

    https://github.com/apache/spark/pull/17926
  
    > It seems adding a functionality and not a trivial fix. I think we need a 
JIRA.
    
    It's up to you. All I'm doing is passing a keyword argument from one 
preexisting public method to another. I don't view that as adding 
functionality, but I am not the arbiter of such things.
    
    > I think this is a rather niche case and we can workaround by 
parallelizing outside.
    
    It has been a rather common case for me since I'm often working with 
`pandas.DataFrame`s of millions of rows with many columns of mixed types (where 
any numeric types are implicitly `numpy` types, rather than base). It can be 
worked outside by manually performing the steps inside of `createDataFrame`:
    
    ```
    df = spark.createDataFrame(spark.sparkContext.parallelize([r.tolist() for r 
in pandas_df.to_records(index=False)], numSlices=5)), schema=[str(_) for _ in 
pandas_df.columns])
    ```
    
    Again, I don't see the proposed change as adding any functionality, just 
exposing machinery already in place for distributing Python data to an `RDD` in 
a consistent way.
    
    > Also, this looks only applying when the data is not RDD. I think this is 
confusing if a user sets this and this option is not working in some cases 
unless the user reads the documentation.
    
    Given that `RDD` and local data are necessarily different and that 
`createDataFrame` already has separate code paths for `RDD` and local Python 
data, I don't know how this can be avoided.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17926: [MINOR][SQL][PYSPARK] Allow user to specify numSlices in...

Reply via email to