[ https://issues.apache.org/jira/browse/SPARK-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell updated SPARK-6968: ----------------------------------- Priority: Critical (was: Minor) > Make maniuplating an underlying RDD of a DataFrame easier > --------------------------------------------------------- > > Key: SPARK-6968 > URL: https://issues.apache.org/jira/browse/SPARK-6968 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0 > Environment: AWS EMR > Reporter: John Muller > Priority: Critical > Original Estimate: 336h > Remaining Estimate: 336h > > Use case: let's say you want to coalesce the RDD underpinning a DataFrame so > that you get a certain number of partitions when you go to save it: > {code:title=RDDsAndDataFrames.scala|borderStyle=solid} > val sc: SparkContext // An existing SparkContext. > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", > "avro") > val coalescedRowRdd = df.rdd.coalesce(8) > // Now the tricky part, you have to get the schema of the original dataframe: > val originalSchema = df.schema > val finallyCoalescedDF = sqlContext.createDataFrame(coalescedRowRdd , > originalSchema ) > {code} > Basically, it would be nice to have an "attachRDD" method on DataFrames, that > requires a RDD[Row], so long as it has the same schema, we should be good: > {code:title=SimplierRDDsAndDataFrames.scala|borderStyle=solid} > val sc: SparkContext // An existing SparkContext. > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", > "avro") > val finallyCoalescedDF = df.attachRDD(df.rdd.coalesce(8) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org