[ https://issues.apache.org/jira/browse/SPARK-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502386#comment-14502386 ]
Patrick Wendell edited comment on SPARK-6968 at 4/20/15 5:56 AM: ----------------------------------------------------------------- So coalesce() is added to the DataFrame directly in 1.4. In cases where you want to drop to the RDD API and go back, I think you do need to pass the original schema. It might be nice to have a constructor for a DataFrame from an RDD[Row] that can infer the schema (/cc [~marmbrus]). was (Author: pwendell): So coalesce() is added to the DataFrame directly in 1.4. And for other cases where you drop down to an RDD[Row], I think you can just call "toDF()" on that RDD to get back a DataFrame, provided you have sqlContext.implicits._ imported. > Make maniuplating an underlying RDD of a DataFrame easier > --------------------------------------------------------- > > Key: SPARK-6968 > URL: https://issues.apache.org/jira/browse/SPARK-6968 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0 > Environment: AWS EMR > Reporter: John Muller > Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > Use case: let's say you want to coalesce the RDD underpinning a DataFrame so > that you get a certain number of partitions when you go to save it: > {code:title=RDDsAndDataFrames.scala|borderStyle=solid} > val sc: SparkContext // An existing SparkContext. > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", > "avro") > val coalescedRowRdd = df.rdd.coalesce(8) > // Now the tricky part, you have to get the schema of the original dataframe: > val originalSchema = df.schema > val finallyCoalescedDF = sqlContext.createDataFrame(coalescedRowRdd , > originalSchema ) > {code} > Basically, it would be nice to have an "attachRDD" method on DataFrames, that > requires a RDD[Row], so long as it has the same schema, we should be good: > {code:title=SimplierRDDsAndDataFrames.scala|borderStyle=solid} > val sc: SparkContext // An existing SparkContext. > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", > "avro") > val finallyCoalescedDF = df.attachRDD(df.rdd.coalesce(8) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org