Alright I have merged the patch ( https://github.com/apache/spark/pull/4173 ) since I don't see any strong opinions against it (as a matter of fact most were for it). We can still change it if somebody lays out a strong argument.
On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > The type alias means your methods can specify either type and they will > work. It's just another name for the same type. But Scaladocs and such will > show DataFrame as the type. > > Matei > > > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > > > > Reynold, > > But with type alias we will have the same problem, right? > > If the methods doesn't receive schemardd anymore, we will have to change > > our code to migrade from schema to dataframe. Unless we have an implicit > > conversion between DataFrame and SchemaRDD > > > > > > > > 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>: > > > >> Dirceu, > >> > >> That is not possible because one cannot overload return types. > >> > >> SQLContext.parquetFile (and many other methods) needs to return some > type, > >> and that type cannot be both SchemaRDD and DataFrame. > >> > >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to > not > >> break source compatibility for Scala. > >> > >> > >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho < > >> dirceu.semigh...@gmail.com> wrote: > >> > >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in > the > >>> release 1.5(+/- 1) for example, and the new code been added to > DataFrame? > >>> With this, we don't impact in existing code for the next few releases. > >>> > >>> > >>> > >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.da...@gmail.com>: > >>> > >>>> I want to address the issue that Matei raised about the heavy lifting > >>>> required for a full SQL support. It is amazing that even after 30 > years > >>> of > >>>> research there is not a single good open source columnar database like > >>>> Vertica. There is a column store option in MySQL, but it is not nearly > >>> as > >>>> sophisticated as Vertica or MonetDB. But there's a true need for such > a > >>>> system. I wonder why so and it's high time to change that. > >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.r...@cloudera.com> > wrote: > >>>> > >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the > >>> former > >>>>> slightly better because it's more descriptive. > >>>>> > >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it > >>> would > >>>>> be more clear from a user-facing perspective to at least choose a > >>> package > >>>>> name for it that omits "sql". > >>>>> > >>>>> I would also be in favor of adding a separate Spark Schema module for > >>>> Spark > >>>>> SQL to rely on, but I imagine that might be too large a change at > this > >>>>> point? > >>>>> > >>>>> -Sandy > >>>>> > >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia < > >>> matei.zaha...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> (Actually when we designed Spark SQL we thought of giving it another > >>>>> name, > >>>>>> like Spark Schema, but we decided to stick with SQL since that was > >>> the > >>>>> most > >>>>>> obvious use case to many users.) > >>>>>> > >>>>>> Matei > >>>>>> > >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia < > >>> matei.zaha...@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> While it might be possible to move this concept to Spark Core > >>>>> long-term, > >>>>>> supporting structured data efficiently does require quite a bit of > >>> the > >>>>>> infrastructure in Spark SQL, such as query planning and columnar > >>>> storage. > >>>>>> The intent of Spark SQL though is to be more than a SQL server -- > >>> it's > >>>>>> meant to be a library for manipulating structured data. Since this > >>> is > >>>>>> possible to build over the core API, it's pretty natural to > >>> organize it > >>>>>> that way, same as Spark Streaming is a library. > >>>>>>> > >>>>>>> Matei > >>>>>>> > >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> > >>>> wrote: > >>>>>>>> > >>>>>>>> "The context is that SchemaRDD is becoming a common data format > >>> used > >>>>> for > >>>>>>>> bringing data into Spark from external systems, and used for > >>> various > >>>>>>>> components of Spark, e.g. MLlib's new pipeline API." > >>>>>>>> > >>>>>>>> i agree. this to me also implies it belongs in spark core, not > >>> sql > >>>>>>>> > >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < > >>>>>>>> michaelma...@yahoo.com.invalid> wrote: > >>>>>>>> > >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the Jan. > >>> 13 > >>>> Bay > >>>>>> Area > >>>>>>>>> Spark Meetup YouTube contained a wealth of background > >>> information > >>>> on > >>>>>> this > >>>>>>>>> idea (mostly from Patrick and Reynold :-). > >>>>>>>>> > >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ > >>>>>>>>> > >>>>>>>>> ________________________________ > >>>>>>>>> From: Patrick Wendell <pwend...@gmail.com> > >>>>>>>>> To: Reynold Xin <r...@databricks.com> > >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> > >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM > >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> One thing potentially not clear from this e-mail, there will be > >>> a > >>>> 1:1 > >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin < > >>> r...@databricks.com> > >>>>>> wrote: > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and > >>>>> wanted > >>>>>> to > >>>>>>>>>> get the community's opinion. > >>>>>>>>>> > >>>>>>>>>> The context is that SchemaRDD is becoming a common data format > >>>> used > >>>>>> for > >>>>>>>>>> bringing data into Spark from external systems, and used for > >>>> various > >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also > >>> expect > >>>>>> more > >>>>>>>>> and > >>>>>>>>>> more users to be programming directly against SchemaRDD API > >>> rather > >>>>>> than > >>>>>>>>> the > >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL > >>>>> originally > >>>>>>>>>> designed for writing test cases, always has the data-frame like > >>>> API. > >>>>>> In > >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for end > >>>>> users. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> There are two motivations for the renaming: > >>>>>>>>>> > >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than > >>> SchemaRDD. > >>>>>>>>>> > >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD > >>> anymore > >>>>>> (even > >>>>>>>>>> though it would contain some RDD functions like map, flatMap, > >>>> etc), > >>>>>> and > >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly > >>> confusing. > >>>>>>>>> Instead. > >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD > >>> methods. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> My understanding is that very few users program directly > >>> against > >>>> the > >>>>>>>>>> SchemaRDD API at the moment, because they are not well > >>> documented. > >>>>>>>>> However, > >>>>>>>>>> oo maintain backward compatibility, we can create a type alias > >>>>>> DataFrame > >>>>>>>>>> that is still named SchemaRDD. This will maintain source > >>>>> compatibility > >>>>>>>>> for > >>>>>>>>>> Scala. That said, we will have to update all existing > >>> materials to > >>>>> use > >>>>>>>>>> DataFrame rather than SchemaRDD. > >>>>>>>>> > >>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > >>>>>>>>> > >>>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>> --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >