[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

Sean Owen (JIRA) Thu, 11 Aug 2016 06:46:45 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417268#comment-15417268
 ]


Sean Owen commented on SPARK-17020:
-----------------------------------

Yeah, after it's cached and the partitions are established, I'd certainly 
expect it to do the sensible thing and use that locality, and that you'd find 
the locality of the RDD's partitions is the same and well-distributed.

What's the code path where you cache the DataFrame? I only see the RDD cached 
here.

> Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17020
>                 URL: https://issues.apache.org/jira/browse/SPARK-17020
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Roi Reshef
>         Attachments: dataframe_cache.PNG, rdd_cache.PNG
>
>
> Calling DataFrame's lazy val .rdd results with a new RDD with a poor 
> distribution of partitions across the cluster. Moreover, any attempt to 
> repartition this RDD further will fail.
> Attached are a screenshot of the original DataFrame on cache and the 
> resulting RDD on cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17020) Materialization of RDD via DataFrame.rdd forces a poor re-distribution of data

Reply via email to