[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

Roi Reshef (JIRA) Thu, 11 Aug 2016 06:14:53 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417204#comment-15417204
 ]


Roi Reshef edited comment on SPARK-17020 at 8/11/16 1:13 PM:
-------------------------------------------------------------

[~srowen] I have 2 DataFrames that are generated from spark-csv reader.
Then I pass them through several transformations, and join them together.
After that I call either .rdd or .flatMap to get an RDD out of the joint 
DataFrame.

Throughout the whole process I've monitored the distribution of the DataFrames. 
It is good until the point where ".rdd" is called


was (Author: roireshef):
[~srowen] I have 2 DataFrames that are generated from spark-csv reader.
Then I pass them through several transformations, and join them together.
After that I call either .rdd or .flatMap to get an RDD out of the joint 
DataFrame.

Throughout all the process I've monitored the distribution of the DataFrames. 
It is good until the point where ".rdd" is called

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17020
>                 URL: https://issues.apache.org/jira/browse/SPARK-17020
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Roi Reshef
>            Priority: Critical
>         Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

Reply via email to