[jira] [Comment Edited] (SPARK-6968) Make maniuplating an underlying RDD of a DataFrame easier

Patrick Wendell (JIRA) Sun, 19 Apr 2015 22:57:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502386#comment-14502386
 ]


Patrick Wendell edited comment on SPARK-6968 at 4/20/15 5:56 AM:
-----------------------------------------------------------------

So coalesce() is added to the DataFrame directly in 1.4.

In cases where you want to drop to the RDD API and go back, I think you do need 
to pass the original schema. It might be nice to have a constructor for a 
DataFrame from an RDD[Row] that can infer the schema (/cc [~marmbrus]).


was (Author: pwendell):
So coalesce() is added to the DataFrame directly in 1.4. And for other cases 
where you drop down to an RDD[Row], I think you can just call "toDF()" on that 
RDD to get back a DataFrame, provided you have sqlContext.implicits._ imported.

> Make maniuplating an underlying RDD of a DataFrame easier
> ---------------------------------------------------------
>
>                 Key: SPARK-6968
>                 URL: https://issues.apache.org/jira/browse/SPARK-6968
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment: AWS EMR
>            Reporter: John Muller
>            Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Use case:  let's say you want to coalesce the RDD underpinning a DataFrame so 
> that you get a certain number of partitions when you go to save it:
> {code:title=RDDsAndDataFrames.scala|borderStyle=solid}
> val sc: SparkContext // An existing SparkContext.
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", 
> "avro")
> val coalescedRowRdd = df.rdd.coalesce(8)
> // Now the tricky part, you have to get the schema of the original dataframe:
> val originalSchema = df.schema
> val finallyCoalescedDF = sqlContext.createDataFrame(coalescedRowRdd , 
> originalSchema )
> {code}
> Basically, it would be nice to have an "attachRDD" method on DataFrames, that 
> requires a RDD[Row], so long as it has the same schema, we should be good:
> {code:title=SimplierRDDsAndDataFrames.scala|borderStyle=solid}
> val sc: SparkContext // An existing SparkContext.
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", 
> "avro")
> val finallyCoalescedDF = df.attachRDD(df.rdd.coalesce(8)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-6968) Make maniuplating an underlying RDD of a DataFrame easier

Reply via email to