Re: renaming SchemaRDD -> DataFrame

Reynold Xin Tue, 27 Jan 2015 16:37:40 -0800

Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter of fact
most were for it). We can still change it if somebody lays out a strong
argument.


On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> The type alias means your methods can specify either type and they will
> work. It's just another name for the same type. But Scaladocs and such will
> show DataFrame as the type.
>
> Matei
>
> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
> >
> > Reynold,
> > But with type alias we will have the same problem, right?
> > If the methods doesn't receive schemardd anymore, we will have to change
> > our code to migrade from schema to dataframe. Unless we have an implicit
> > conversion between DataFrame and SchemaRDD
> >
> >
> >
> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>:
> >
> >> Dirceu,
> >>
> >> That is not possible because one cannot overload return types.
> >>
> >> SQLContext.parquetFile (and many other methods) needs to return some
> type,
> >> and that type cannot be both SchemaRDD and DataFrame.
> >>
> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
> not
> >> break source compatibility for Scala.
> >>
> >>
> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> dirceu.semigh...@gmail.com> wrote:
> >>
> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in
> the
> >>> release 1.5(+/- 1)  for example, and the new code been added to
> DataFrame?
> >>> With this, we don't impact in existing code for the next few releases.
> >>>
> >>>
> >>>
> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.da...@gmail.com>:
> >>>
> >>>> I want to address the issue that Matei raised about the heavy lifting
> >>>> required for a full SQL support. It is amazing that even after 30
> years
> >>> of
> >>>> research there is not a single good open source columnar database like
> >>>> Vertica. There is a column store option in MySQL, but it is not nearly
> >>> as
> >>>> sophisticated as Vertica or MonetDB. But there's a true need for such
> a
> >>>> system. I wonder why so and it's high time to change that.
> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.r...@cloudera.com>
> wrote:
> >>>>
> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
> >>> former
> >>>>> slightly better because it's more descriptive.
> >>>>>
> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
> >>> would
> >>>>> be more clear from a user-facing perspective to at least choose a
> >>> package
> >>>>> name for it that omits "sql".
> >>>>>
> >>>>> I would also be in favor of adding a separate Spark Schema module for
> >>>> Spark
> >>>>> SQL to rely on, but I imagine that might be too large a change at
> this
> >>>>> point?
> >>>>>
> >>>>> -Sandy
> >>>>>
> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >>> matei.zaha...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> (Actually when we designed Spark SQL we thought of giving it another
> >>>>> name,
> >>>>>> like Spark Schema, but we decided to stick with SQL since that was
> >>> the
> >>>>> most
> >>>>>> obvious use case to many users.)
> >>>>>>
> >>>>>> Matei
> >>>>>>
> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >>> matei.zaha...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> While it might be possible to move this concept to Spark Core
> >>>>> long-term,
> >>>>>> supporting structured data efficiently does require quite a bit of
> >>> the
> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
> >>>> storage.
> >>>>>> The intent of Spark SQL though is to be more than a SQL server --
> >>> it's
> >>>>>> meant to be a library for manipulating structured data. Since this
> >>> is
> >>>>>> possible to build over the core API, it's pretty natural to
> >>> organize it
> >>>>>> that way, same as Spark Streaming is a library.
> >>>>>>>
> >>>>>>> Matei
> >>>>>>>
> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> "The context is that SchemaRDD is becoming a common data format
> >>> used
> >>>>> for
> >>>>>>>> bringing data into Spark from external systems, and used for
> >>> various
> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >>>>>>>>
> >>>>>>>> i agree. this to me also implies it belongs in spark core, not
> >>> sql
> >>>>>>>>
> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >>>>>>>> michaelma...@yahoo.com.invalid> wrote:
> >>>>>>>>
> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan.
> >>> 13
> >>>> Bay
> >>>>>> Area
> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >>> information
> >>>> on
> >>>>>> this
> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >>>>>>>>>
> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>>>>>>>
> >>>>>>>>> ________________________________
> >>>>>>>>> From: Patrick Wendell <pwend...@gmail.com>
> >>>>>>>>> To: Reynold Xin <r...@databricks.com>
> >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> One thing potentially not clear from this e-mail, there will be
> >>> a
> >>>> 1:1
> >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >>> r...@databricks.com>
> >>>>>> wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
> >>>>> wanted
> >>>>>> to
> >>>>>>>>>> get the community's opinion.
> >>>>>>>>>>
> >>>>>>>>>> The context is that SchemaRDD is becoming a common data format
> >>>> used
> >>>>>> for
> >>>>>>>>>> bringing data into Spark from external systems, and used for
> >>>> various
> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
> >>> expect
> >>>>>> more
> >>>>>>>>> and
> >>>>>>>>>> more users to be programming directly against SchemaRDD API
> >>> rather
> >>>>>> than
> >>>>>>>>> the
> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
> >>>>> originally
> >>>>>>>>>> designed for writing test cases, always has the data-frame like
> >>>> API.
> >>>>>> In
> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for end
> >>>>> users.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> There are two motivations for the renaming:
> >>>>>>>>>>
> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> >>> SchemaRDD.
> >>>>>>>>>>
> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
> >>> anymore
> >>>>>> (even
> >>>>>>>>>> though it would contain some RDD functions like map, flatMap,
> >>>> etc),
> >>>>>> and
> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> >>> confusing.
> >>>>>>>>> Instead.
> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
> >>> methods.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> My understanding is that very few users program directly
> >>> against
> >>>> the
> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> >>> documented.
> >>>>>>>>> However,
> >>>>>>>>>> oo maintain backward compatibility, we can create a type alias
> >>>>>> DataFrame
> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
> >>>>> compatibility
> >>>>>>>>> for
> >>>>>>>>>> Scala. That said, we will have to update all existing
> >>> materials to
> >>>>> use
> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >>>>>>>>>
> >>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to