Re: renaming SchemaRDD -> DataFrame

Matei Zaharia Mon, 26 Jan 2015 17:34:03 -0800

While it might be possible to move this concept to Spark Core long-term, 
supporting structured data efficiently does require quite a bit of the 
infrastructure in Spark SQL, such as query planning and columnar storage. The 
intent of Spark SQL though is to be more than a SQL server -- it's meant to be 
a library for manipulating structured data. Since this is possible to build 
over the core API, it's pretty natural to organize it that way, same as Spark 
Streaming is a library.


Matei

> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> "The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API."
> 
> i agree. this to me also implies it belongs in spark core, not sql
> 
> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> michaelma...@yahoo.com.invalid> wrote:
> 
>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area
>> Spark Meetup YouTube contained a wealth of background information on this
>> idea (mostly from Patrick and Reynold :-).
>> 
>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> 
>> ________________________________
>> From: Patrick Wendell <pwend...@gmail.com>
>> To: Reynold Xin <r...@databricks.com>
>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
>> Sent: Monday, January 26, 2015 4:01 PM
>> Subject: Re: renaming SchemaRDD -> DataFrame
>> 
>> 
>> One thing potentially not clear from this e-mail, there will be a 1:1
>> correspondence where you can get an RDD to/from a DataFrame.
>> 
>> 
>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <r...@databricks.com> wrote:
>>> Hi,
>>> 
>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
>>> get the community's opinion.
>>> 
>>> The context is that SchemaRDD is becoming a common data format used for
>>> bringing data into Spark from external systems, and used for various
>>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
>> and
>>> more users to be programming directly against SchemaRDD API rather than
>> the
>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>>> designed for writing test cases, always has the data-frame like API. In
>>> 1.3, we are redesigning the API to make the API usable for end users.
>>> 
>>> 
>>> There are two motivations for the renaming:
>>> 
>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>> 
>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
>>> though it would contain some RDD functions like map, flatMap, etc), and
>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>> Instead.
>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>> 
>>> 
>>> My understanding is that very few users program directly against the
>>> SchemaRDD API at the moment, because they are not well documented.
>> However,
>>> oo maintain backward compatibility, we can create a type alias DataFrame
>>> that is still named SchemaRDD. This will maintain source compatibility
>> for
>>> Scala. That said, we will have to update all existing materials to use
>>> DataFrame rather than SchemaRDD.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Reply via email to