Re: renaming SchemaRDD -> DataFrame

Koert Kuipers Tue, 27 Jan 2015 11:31:48 -0800

thats great. guess i was looking at a somewhat stale master branch...

On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin <[email protected]> wrote:


> Koert,
>
> As Mark said, I have already refactored the API so that nothing is
> catalyst is exposed (and users won't need them anyway). Data types, Row
> interfaces are both outside catalyst package and in org.apache.spark.sql.
>
> On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers <[email protected]> wrote:
>
>> hey matei,
>> i think that stuff such as SchemaRDD, columar storage and perhaps also
>> query planning can be re-used by many systems that do analysis on
>> structured data. i can imagine panda-like systems, but also datalog or
>> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
>> some point). SchemaRDD should become the interface for all these. and
>> columnar storage abstractions should be re-used between all these.
>>
>> currently the sql tie in is way beyond just the (perhaps unfortunate)
>> naming convention. for example a core part of the SchemaRD abstraction is
>> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
>> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
>> (if i understand it correctly, i have not used catalyst, but it looks
>> neat). i should not need to include a SQL parser just to use structured
>> data in say a panda-like framework.
>>
>> best, koert
>>
>>
>> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <[email protected]>
>> wrote:
>>
>>> While it might be possible to move this concept to Spark Core long-term,
>>> supporting structured data efficiently does require quite a bit of the
>>> infrastructure in Spark SQL, such as query planning and columnar storage.
>>> The intent of Spark SQL though is to be more than a SQL server -- it's
>>> meant to be a library for manipulating structured data. Since this is
>>> possible to build over the core API, it's pretty natural to organize it
>>> that way, same as Spark Streaming is a library.
>>>
>>> Matei
>>>
>>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[email protected]> wrote:
>>> >
>>> > "The context is that SchemaRDD is becoming a common data format used
>>> for
>>> > bringing data into Spark from external systems, and used for various
>>> > components of Spark, e.g. MLlib's new pipeline API."
>>> >
>>> > i agree. this to me also implies it belongs in spark core, not sql
>>> >
>>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>> > [email protected]> wrote:
>>> >
>>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>>> Area
>>> >> Spark Meetup YouTube contained a wealth of background information on
>>> this
>>> >> idea (mostly from Patrick and Reynold :-).
>>> >>
>>> >> https://www.youtube.com/watch?v=YWppYPWznSQ
>>> >>
>>> >> ________________________________
>>> >> From: Patrick Wendell <[email protected]>
>>> >> To: Reynold Xin <[email protected]>
>>> >> Cc: "[email protected]" <[email protected]>
>>> >> Sent: Monday, January 26, 2015 4:01 PM
>>> >> Subject: Re: renaming SchemaRDD -> DataFrame
>>> >>
>>> >>
>>> >> One thing potentially not clear from this e-mail, there will be a 1:1
>>> >> correspondence where you can get an RDD to/from a DataFrame.
>>> >>
>>> >>
>>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <[email protected]>
>>> wrote:
>>> >>> Hi,
>>> >>>
>>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>>> wanted to
>>> >>> get the community's opinion.
>>> >>>
>>> >>> The context is that SchemaRDD is becoming a common data format used
>>> for
>>> >>> bringing data into Spark from external systems, and used for various
>>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
>>> more
>>> >> and
>>> >>> more users to be programming directly against SchemaRDD API rather
>>> than
>>> >> the
>>> >>> core RDD API. SchemaRDD, through its less commonly used DSL
>>> originally
>>> >>> designed for writing test cases, always has the data-frame like API.
>>> In
>>> >>> 1.3, we are redesigning the API to make the API usable for end users.
>>> >>>
>>> >>>
>>> >>> There are two motivations for the renaming:
>>> >>>
>>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>>> >>>
>>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
>>> (even
>>> >>> though it would contain some RDD functions like map, flatMap, etc),
>>> and
>>> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
>>> >> Instead.
>>> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
>>> >>>
>>> >>>
>>> >>> My understanding is that very few users program directly against the
>>> >>> SchemaRDD API at the moment, because they are not well documented.
>>> >> However,
>>> >>> oo maintain backward compatibility, we can create a type alias
>>> DataFrame
>>> >>> that is still named SchemaRDD. This will maintain source
>>> compatibility
>>> >> for
>>> >>> Scala. That said, we will have to update all existing materials to
>>> use
>>> >>> DataFrame rather than SchemaRDD.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >>
>>>
>>>
>>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to