[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

Sean Owen (JIRA) Thu, 11 Aug 2016 07:03:00 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417291#comment-15417291
 ]


Sean Owen commented on SPARK-17020:
-----------------------------------

I see, I was asking because you show the results of caching a DataFrame above.
My guess is that in one case, the DataFrame is computed using the expected 
number of partitions, and somehow when you go straight through to the RDD, it 
ends up executing one task for one partition, thus putting the result in one 
big block. As to why, I don't know. You could confirm/deny by looking at the 
partition count for the DataFrame and RDD in these cases.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17020
>                 URL: https://issues.apache.org/jira/browse/SPARK-17020
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Roi Reshef
>         Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

Reply via email to