[
https://issues.apache.org/jira/browse/SPARK-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502389#comment-14502389
]
Reynold Xin commented on SPARK-6968:
------------------------------------
For now you can use this code snippet:
https://github.com/apache/spark/blob/8220d5265f1bbea9dfdaeec4f2d06d7fe24c0bc3/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L918
I'm going to close this ticket, because the scope is much larger than coalesce.
I'd like to design a better API for manipulation of DataFrames as if they were
RDDs, and allow users to stay entirely in DataFrame land without using .rdd
(have some assumptions about the return type). I will open a separate ticket
when I get around to that.
> Make maniuplating an underlying RDD of a DataFrame easier
> ---------------------------------------------------------
>
> Key: SPARK-6968
> URL: https://issues.apache.org/jira/browse/SPARK-6968
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.3.0
> Environment: AWS EMR
> Reporter: John Muller
> Priority: Minor
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Use case: let's say you want to coalesce the RDD underpinning a DataFrame so
> that you get a certain number of partitions when you go to save it:
> {code:title=RDDsAndDataFrames.scala|borderStyle=solid}
> val sc: SparkContext // An existing SparkContext.
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro",
> "avro")
> val coalescedRowRdd = df.rdd.coalesce(8)
> // Now the tricky part, you have to get the schema of the original dataframe:
> val originalSchema = df.schema
> val finallyCoalescedDF = sqlContext.createDataFrame(coalescedRowRdd ,
> originalSchema )
> {code}
> Basically, it would be nice to have an "attachRDD" method on DataFrames, that
> requires a RDD[Row], so long as it has the same schema, we should be good:
> {code:title=SimplierRDDsAndDataFrames.scala|borderStyle=solid}
> val sc: SparkContext // An existing SparkContext.
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro",
> "avro")
> val finallyCoalescedDF = df.attachRDD(df.rdd.coalesce(8)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]