Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame? With this, we don't impact in existing code for the next few releases.
2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.da...@gmail.com>: > I want to address the issue that Matei raised about the heavy lifting > required for a full SQL support. It is amazing that even after 30 years of > research there is not a single good open source columnar database like > Vertica. There is a column store option in MySQL, but it is not nearly as > sophisticated as Vertica or MonetDB. But there's a true need for such a > system. I wonder why so and it's high time to change that. > On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.r...@cloudera.com> wrote: > > > Both SchemaRDD and DataFrame sound fine to me, though I like the former > > slightly better because it's more descriptive. > > > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would > > be more clear from a user-facing perspective to at least choose a package > > name for it that omits "sql". > > > > I would also be in favor of adding a separate Spark Schema module for > Spark > > SQL to rely on, but I imagine that might be too large a change at this > > point? > > > > -Sandy > > > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <matei.zaha...@gmail.com> > > wrote: > > > > > (Actually when we designed Spark SQL we thought of giving it another > > name, > > > like Spark Schema, but we decided to stick with SQL since that was the > > most > > > obvious use case to many users.) > > > > > > Matei > > > > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com> > > > wrote: > > > > > > > > While it might be possible to move this concept to Spark Core > > long-term, > > > supporting structured data efficiently does require quite a bit of the > > > infrastructure in Spark SQL, such as query planning and columnar > storage. > > > The intent of Spark SQL though is to be more than a SQL server -- it's > > > meant to be a library for manipulating structured data. Since this is > > > possible to build over the core API, it's pretty natural to organize it > > > that way, same as Spark Streaming is a library. > > > > > > > > Matei > > > > > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> > wrote: > > > >> > > > >> "The context is that SchemaRDD is becoming a common data format used > > for > > > >> bringing data into Spark from external systems, and used for various > > > >> components of Spark, e.g. MLlib's new pipeline API." > > > >> > > > >> i agree. this to me also implies it belongs in spark core, not sql > > > >> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < > > > >> michaelma...@yahoo.com.invalid> wrote: > > > >> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 > Bay > > > Area > > > >>> Spark Meetup YouTube contained a wealth of background information > on > > > this > > > >>> idea (mostly from Patrick and Reynold :-). > > > >>> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ > > > >>> > > > >>> ________________________________ > > > >>> From: Patrick Wendell <pwend...@gmail.com> > > > >>> To: Reynold Xin <r...@databricks.com> > > > >>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> > > > >>> Sent: Monday, January 26, 2015 4:01 PM > > > >>> Subject: Re: renaming SchemaRDD -> DataFrame > > > >>> > > > >>> > > > >>> One thing potentially not clear from this e-mail, there will be a > 1:1 > > > >>> correspondence where you can get an RDD to/from a DataFrame. > > > >>> > > > >>> > > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <r...@databricks.com> > > > wrote: > > > >>>> Hi, > > > >>>> > > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and > > wanted > > > to > > > >>>> get the community's opinion. > > > >>>> > > > >>>> The context is that SchemaRDD is becoming a common data format > used > > > for > > > >>>> bringing data into Spark from external systems, and used for > various > > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect > > > more > > > >>> and > > > >>>> more users to be programming directly against SchemaRDD API rather > > > than > > > >>> the > > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL > > originally > > > >>>> designed for writing test cases, always has the data-frame like > API. > > > In > > > >>>> 1.3, we are redesigning the API to make the API usable for end > > users. > > > >>>> > > > >>>> > > > >>>> There are two motivations for the renaming: > > > >>>> > > > >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD. > > > >>>> > > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore > > > (even > > > >>>> though it would contain some RDD functions like map, flatMap, > etc), > > > and > > > >>>> calling it Schema*RDD* while it is not an RDD is highly confusing. > > > >>> Instead. > > > >>>> DataFrame.rdd will return the underlying RDD for all RDD methods. > > > >>>> > > > >>>> > > > >>>> My understanding is that very few users program directly against > the > > > >>>> SchemaRDD API at the moment, because they are not well documented. > > > >>> However, > > > >>>> oo maintain backward compatibility, we can create a type alias > > > DataFrame > > > >>>> that is still named SchemaRDD. This will maintain source > > compatibility > > > >>> for > > > >>>> Scala. That said, we will have to update all existing materials to > > use > > > >>>> DataFrame rather than SchemaRDD. > > > >>> > > > >>> > --------------------------------------------------------------------- > > > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > >>> For additional commands, e-mail: dev-h...@spark.apache.org > > > >>> > > > >>> > --------------------------------------------------------------------- > > > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > >>> For additional commands, e-mail: dev-h...@spark.apache.org > > > >>> > > > >>> > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > > > >