It's a good point. I will update the documentation to say that this is not meant to be subclassed externally.
On Tue, Feb 10, 2015 at 12:10 PM, Koert Kuipers <ko...@tresata.com> wrote: > thanks matei its good to know i can create them like that > > reynold, yeah somehow the words sql gets me going :) sorry... > yeah agreed that you need new transformations to preserve the schema info. > i misunderstood and thought i had to implement the bunch but that is > clearly not necessary as matei indicated. > > allright i am clearly being slow/dense here, but now it makes sense to > me.... > > > > > > > On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin <r...@databricks.com> wrote: > > > Koert, > > > > Don't get too hang up on the name SQL. This is exactly what you want: a > > collection with record-like objects with field names and runtime types. > > > > Almost all of the 40 methods are transformations for structured data, > such > > as aggregation on a field, or filtering on a field. If all you have is > the > > old RDD style map/flatMap, then any transformation would lose the schema > > information, making the extra schema information useless. > > > > > > > > > > On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com> > wrote: > > > >> so i understand the success or spark.sql. besides the fact that anything > >> with the words SQL in its name will have thousands of developers running > >> towards it because of the familiarity, there is also a genuine need for > a > >> generic RDD that holds record-like objects, with field names and runtime > >> types. after all that is a successfull generic abstraction used in many > >> structured data tools. > >> > >> but to me that abstraction is as simple as: > >> > >> trait SchemaRDD extends RDD[Row] { > >> def schema: StructType > >> } > >> > >> and perhaps another abstraction to indicate it intends to be column > >> oriented (with a few methods to efficiently extract a subset of > columns). > >> so that could be DataFrame. > >> > >> such simple contracts would allow many people to write loaders for this > >> (say from csv) and whatnot. > >> > >> what i do not understand why it has to be much more complex than this. > but > >> if i look at DataFrame it has so much additional stuff, that has (in my > >> eyes) nothing to do with generic structured data analysis. > >> > >> for example to implement DataFrame i need to implement about 40 > additional > >> methods!? and for some the SQLness is obviously leaking into the > >> abstraction. for example why would i care about: > >> def registerTempTable(tableName: String): Unit > >> > >> > >> best, koert > >> > >> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.git...@gmail.com> > >> wrote: > >> > >> > It is true that you can persist SchemaRdds / DataFrames to disk via > >> > Parquet, but a lot of time and inefficiencies is lost. The in-memory > >> > columnar cached representation is completely different from the > >> > Parquet file format, and I believe there has to be a translation into > >> > a Row (because ultimately Spark SQL traverses Row's -- even the > >> > InMemoryColumnarTableScan has to then convert the columns into Rows > >> > for row-based processing). On the other hand, traditional data > >> > frames process in a columnar fashion. Columnar storage is good, but > >> > nowhere near as good as columnar processing. > >> > > >> > Another issue, which I don't know if it is solved yet, but it is > >> > difficult for Tachyon to efficiently cache Parquet files without > >> > understanding the file format itself. > >> > > >> > I gave a talk at last year's Spark Summit on this topic. > >> > > >> > I'm working on efforts to change this, however. Shoot me an email at > >> > velvia at gmail if you're interested in joining forces. > >> > > >> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs....@gmail.com> > >> wrote: > >> > > Yes, when a DataFrame is cached in memory, it's stored in an > efficient > >> > > columnar format. And you can also easily persist it on disk using > >> > Parquet, > >> > > which is also columnar. > >> > > > >> > > Cheng > >> > > > >> > > > >> > > On 1/29/15 1:24 PM, Koert Kuipers wrote: > >> > >> > >> > >> to me the word DataFrame does come with certain expectations. one > of > >> > them > >> > >> is that the data is stored columnar. in R data.frame internally > uses > >> a > >> > >> list > >> > >> of sequences i think, but since lists can have labels its more > like a > >> > >> SortedMap[String, Array[_]]. this makes certain operations very > cheap > >> > >> (such > >> > >> as adding a column). > >> > >> > >> > >> in Spark the closest thing would be a data structure where per > >> Partition > >> > >> the data is also stored columnar. does spark SQL already use > >> something > >> > >> like > >> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds > >> like > >> > >> it. where can i find that? > >> > >> > >> > >> thanks > >> > >> > >> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan < > velvia.git...@gmail.com> > >> > >> wrote: > >> > >> > >> > >>> +1.... having proper NA support is much cleaner than using null, > at > >> > >>> least the Java null. > >> > >>> > >> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks < > >> evan.spa...@gmail.com > >> > > > >> > >>> wrote: > >> > >>>> > >> > >>>> You've got to be a little bit careful here. "NA" in systems like > R > >> or > >> > >>> > >> > >>> pandas > >> > >>>> > >> > >>>> may have special meaning that is distinct from "null". > >> > >>>> > >> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/ > >> > >>>> > >> > >>>> > >> > >>>> > >> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin < > r...@databricks.com> > >> > >>> > >> > >>> wrote: > >> > >>>>> > >> > >>>>> Isn't that just "null" in SQL? > >> > >>>>> > >> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan < > >> velvia.git...@gmail.com> > >> > >>>>> wrote: > >> > >>>>> > >> > >>>>>> I believe that most DataFrame implementations out there, like > >> > Pandas, > >> > >>>>>> supports the idea of missing values / NA, and some support the > >> idea > >> > of > >> > >>>>>> Not Meaningful as well. > >> > >>>>>> > >> > >>>>>> Does Row support anything like that? That is important for > >> certain > >> > >>>>>> applications. I thought that Row worked by being a mutable > >> object, > >> > >>>>>> but haven't looked into the details in a while. > >> > >>>>>> > >> > >>>>>> -Evan > >> > >>>>>> > >> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin < > >> r...@databricks.com> > >> > >>>>>> wrote: > >> > >>>>>>> > >> > >>>>>>> It shouldn't change the data source api at all because data > >> sources > >> > >>>>>> > >> > >>>>>> create > >> > >>>>>>> > >> > >>>>>>> RDD[Row], and that gets converted into a DataFrame > automatically > >> > >>>>>> > >> > >>>>>> (previously > >> > >>>>>>> > >> > >>>>>>> to SchemaRDD). > >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>> > >> > >>> > >> > >>> > >> > > >> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala > >> > >>>>>>> > >> > >>>>>>> One thing that will break the data source API in 1.3 is the > >> > location > >> > >>>>>>> of > >> > >>>>>>> types. Types were previously defined in sql.catalyst.types, > and > >> now > >> > >>>>>> > >> > >>>>>> moved to > >> > >>>>>>> > >> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and > all > >> > >>>>>>> public > >> > >>>>>> > >> > >>>>>> APIs > >> > >>>>>>> > >> > >>>>>>> have first class classes/objects defined in sql directly. > >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan < > >> > velvia.git...@gmail.com > >> > >>>>>> > >> > >>>>>> wrote: > >> > >>>>>>>> > >> > >>>>>>>> Hey guys, > >> > >>>>>>>> > >> > >>>>>>>> How does this impact the data sources API? I was planning on > >> > using > >> > >>>>>>>> this for a project. > >> > >>>>>>>> > >> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally > >> > >>>>>>>> desirable and useful. > >> > >>>>>>>> > >> > >>>>>>>> By the way, one thing that prevents the columnar compression > >> stuff > >> > >>> > >> > >>> in > >> > >>>>>>>> > >> > >>>>>>>> Spark SQL from being more useful is, at least from previous > >> talks > >> > >>>>>>>> with > >> > >>>>>>>> Reynold and Michael et al., that the format was not designed > >> for > >> > >>>>>>>> persistence. > >> > >>>>>>>> > >> > >>>>>>>> I have a new project that aims to change that. It is a > >> > >>>>>>>> zero-serialisation, high performance binary vector library, > >> > >>> > >> > >>> designed > >> > >>>>>>>> > >> > >>>>>>>> from the outset to be a persistent storage friendly. May be > >> one > >> > >>> > >> > >>> day > >> > >>>>>>>> > >> > >>>>>>>> it can replace the Spark SQL columnar compression. > >> > >>>>>>>> > >> > >>>>>>>> Michael told me this would be a lot of work, and recreates > >> parts > >> > of > >> > >>>>>>>> Parquet, but I think it's worth it. LMK if you'd like more > >> > >>> > >> > >>> details. > >> > >>>>>>>> > >> > >>>>>>>> -Evan > >> > >>>>>>>> > >> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin < > >> r...@databricks.com > >> > > > >> > >>>>>> > >> > >>>>>> wrote: > >> > >>>>>>>>> > >> > >>>>>>>>> Alright I have merged the patch ( > >> > >>>>>>>>> https://github.com/apache/spark/pull/4173 > >> > >>>>>>>>> ) since I don't see any strong opinions against it (as a > >> matter > >> > >>> > >> > >>> of > >> > >>>>>> > >> > >>>>>> fact > >> > >>>>>>>>> > >> > >>>>>>>>> most were for it). We can still change it if somebody lays > >> out a > >> > >>>>>> > >> > >>>>>> strong > >> > >>>>>>>>> > >> > >>>>>>>>> argument. > >> > >>>>>>>>> > >> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia > >> > >>>>>>>>> <matei.zaha...@gmail.com> > >> > >>>>>>>>> wrote: > >> > >>>>>>>>> > >> > >>>>>>>>>> The type alias means your methods can specify either type > and > >> > >>> > >> > >>> they > >> > >>>>>> > >> > >>>>>> will > >> > >>>>>>>>>> > >> > >>>>>>>>>> work. It's just another name for the same type. But > Scaladocs > >> > >>> > >> > >>> and > >> > >>>>>> > >> > >>>>>> such > >> > >>>>>>>>>> > >> > >>>>>>>>>> will > >> > >>>>>>>>>> show DataFrame as the type. > >> > >>>>>>>>>> > >> > >>>>>>>>>> Matei > >> > >>>>>>>>>> > >> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho < > >> > >>>>>>>>>> > >> > >>>>>>>>>> dirceu.semigh...@gmail.com> wrote: > >> > >>>>>>>>>>> > >> > >>>>>>>>>>> Reynold, > >> > >>>>>>>>>>> But with type alias we will have the same problem, right? > >> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will > >> have > >> > >>>>>>>>>>> to > >> > >>>>>>>>>>> change > >> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we > have > >> > >>> > >> > >>> an > >> > >>>>>>>>>>> > >> > >>>>>>>>>>> implicit > >> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD > >> > >>>>>>>>>>> > >> > >>>>>>>>>>> > >> > >>>>>>>>>>> > >> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin < > r...@databricks.com > >> >: > >> > >>>>>>>>>>> > >> > >>>>>>>>>>>> Dirceu, > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> That is not possible because one cannot overload return > >> > >>> > >> > >>> types. > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to > >> > >>> > >> > >>> return > >> > >>>>>> > >> > >>>>>> some > >> > >>>>>>>>>> > >> > >>>>>>>>>> type, > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame. > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called > >> > >>>>>>>>>>>> SchemaRDD > >> > >>>>>>>>>>>> to > >> > >>>>>>>>>> > >> > >>>>>>>>>> not > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> break source compatibility for Scala. > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho < > >> > >>>>>>>>>>>> dirceu.semigh...@gmail.com> wrote: > >> > >>>>>>>>>>>> > >> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and > >> be > >> > >>>>>> > >> > >>>>>> removed > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> in > >> > >>>>>>>>>> > >> > >>>>>>>>>> the > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> release 1.5(+/- 1) for example, and the new code been > >> added > >> > >>>>>>>>>>>>> to > >> > >>>>>>>>>> > >> > >>>>>>>>>> DataFrame? > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next > >> few > >> > >>>>>>>>>>>>> releases. > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta > >> > >>>>>>>>>>>>> <kushal.da...@gmail.com>: > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the > >> > >>> > >> > >>> heavy > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> lifting > >> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that > even > >> > >>>>>>>>>>>>>> after > >> > >>>>>> > >> > >>>>>> 30 > >> > >>>>>>>>>> > >> > >>>>>>>>>> years > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> of > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> research there is not a single good open source > columnar > >> > >>>>>> > >> > >>>>>> database > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> like > >> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but > it > >> is > >> > >>>>>>>>>>>>>> not > >> > >>>>>>>>>>>>>> nearly > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> as > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true > >> > >>> > >> > >>> need > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>>> such > >> > >>>>>>>>>> > >> > >>>>>>>>>> a > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change > >> that. > >> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" > >> > >>>>>>>>>>>>>> <sandy.r...@cloudera.com> > >> > >>>>>>>>>> > >> > >>>>>>>>>> wrote: > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though > I > >> > >>> > >> > >>> like > >> > >>>>>> > >> > >>>>>> the > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> former > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> slightly better because it's more descriptive. > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under > the > >> > >>>>>> > >> > >>>>>> covers, > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> it > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> would > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at > least > >> > >>>>>> > >> > >>>>>> choose a > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> package > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> name for it that omits "sql". > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark > >> Schema > >> > >>>>>> > >> > >>>>>> module > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> Spark > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large > a > >> > >>>>>>>>>>>>>>> change > >> > >>>>>> > >> > >>>>>> at > >> > >>>>>>>>>> > >> > >>>>>>>>>> this > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> point? > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> -Sandy > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia < > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> matei.zaha...@gmail.com> > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> wrote: > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of > >> giving > >> > >>>>>>>>>>>>>>>> it > >> > >>>>>>>>>>>>>>>> another > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> name, > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL > >> since > >> > >>>>>>>>>>>>>>>> that > >> > >>>>>>>>>>>>>>>> was > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> the > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> most > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> obvious use case to many users.) > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> Matei > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia < > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> matei.zaha...@gmail.com> > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> wrote: > >> > >>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to > >> Spark > >> > >>>>>>>>>>>>>>>>> Core > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> long-term, > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require > >> > >>> > >> > >>> quite a > >> > >>>>>> > >> > >>>>>> bit > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> of > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> the > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning > and > >> > >>>>>> > >> > >>>>>> columnar > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> storage. > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a > SQL > >> > >>>>>>>>>>>>>>>> server > >> > >>>>>>>>>>>>>>>> -- > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> it's > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured > data. > >> > >>>>>>>>>>>>>>>> Since > >> > >>>>>>>>>>>>>>>> this > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> is > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty > >> natural > >> > >>> > >> > >>> to > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> organize it > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library. > >> > >>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>> Matei > >> > >>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers < > >> > >>>>>> > >> > >>>>>> ko...@tresata.com> > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> wrote: > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common > >> > >>> > >> > >>> data > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> format > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> used > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and > >> > >>> > >> > >>> used > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> various > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline > API." > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in > spark > >> > >>>>>>>>>>>>>>>>>> core, > >> > >>>>>> > >> > >>>>>> not > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> sql > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < > >> > >>>>>>>>>>>>>>>>>> michaelma...@yahoo.com.invalid> wrote: > >> > >>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it > >> yet, > >> > >>>>>>>>>>>>>>>>>>> the > >> > >>>>>>>>>>>>>>>>>>> Jan. > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> 13 > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> Bay > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> Area > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of > >> background > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> information > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> on > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> this > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-). > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> ________________________________ > >> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pwend...@gmail.com> > >> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <r...@databricks.com> > >> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> > >> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM > >> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail, > >> > >>> > >> > >>> there > >> > >>>>>> > >> > >>>>>> will > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> be > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> a > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> 1:1 > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a > >> > >>>>>> > >> > >>>>>> DataFrame. > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin < > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> r...@databricks.com> > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> wrote: > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> Hi, > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> > DataFrame > >> in > >> > >>>>>>>>>>>>>>>>>>>> 1.3, > >> > >>>>>>>>>>>>>>>>>>>> and > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> wanted > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> to > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> get the community's opinion. > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a > common > >> > >>> > >> > >>> data > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> format > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> used > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, > and > >> > >>>>>>>>>>>>>>>>>>>> used > >> > >>>>>> > >> > >>>>>> for > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> various > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline > API. > >> > >>> > >> > >>> We > >> > >>>>>> > >> > >>>>>> also > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> expect > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> more > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> and > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against > >> > >>> > >> > >>> SchemaRDD > >> > >>>>>> > >> > >>>>>> API > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> rather > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> than > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> the > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less > commonly > >> > >>> > >> > >>> used > >> > >>>>>> > >> > >>>>>> DSL > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> originally > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the > >> > >>>>>>>>>>>>>>>>>>>> data-frame > >> > >>>>>>>>>>>>>>>>>>>> like > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> API. > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> In > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API > >> > >>> > >> > >>> usable > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>>>>>>>>> end > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> users. > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming: > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name > >> > >>> > >> > >>> than > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> SchemaRDD. > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to > be > >> an > >> > >>>>>>>>>>>>>>>>>>>> RDD > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> anymore > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> (even > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like > >> map, > >> > >>>>>>>>>>>>>>>>>>>> flatMap, > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> etc), > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> and > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is > >> > >>> > >> > >>> highly > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> confusing. > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> Instead. > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for > >> all > >> > >>>>>>>>>>>>>>>>>>>> RDD > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> methods. > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program > >> > >>>>>>>>>>>>>>>>>>>> directly > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> against > >> > >>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> the > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not > >> > >>> > >> > >>> well > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> documented. > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> However, > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can > create a > >> > >>>>>>>>>>>>>>>>>>>> type > >> > >>>>>>>>>>>>>>>>>>>> alias > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> DataFrame > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain > >> > >>>>>>>>>>>>>>>>>>>> source > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> compatibility > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> for > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all > >> existing > >> > >>>>>>>>>>>>> > >> > >>>>>>>>>>>>> materials to > >> > >>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>> use > >> > >>>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD. > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> > >> > >>>>>> > >> > --------------------------------------------------------------------- > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: > >> > >>> > >> > >>> dev-unsubscr...@spark.apache.org > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail: > >> > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>> > >> > >>>>>> > >> > --------------------------------------------------------------------- > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: > >> > >>> > >> > >>> dev-unsubscr...@spark.apache.org > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail: > >> > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>> > >> > >>>>>> > >> > --------------------------------------------------------------------- > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail: > >> dev-unsubscr...@spark.apache.org > >> > >>>>>>>>>>>>>>>> For additional commands, e-mail: > >> > >>> > >> > >>> dev-h...@spark.apache.org > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>>>>>> > >> > >>>>>>>>>>>> > >> > >>>>>>>>>> > >> > >>>>>>>>>> > >> > >>>>>>>>>> > >> > >>> > >> --------------------------------------------------------------------- > >> > >>>>>>>>>> > >> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > >> > >>>>>>>>>> > >> > >>>>>>>>>> > >> > >>>>>>> > >> > >>>> > >> > >>> > >> --------------------------------------------------------------------- > >> > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> > >>> For additional commands, e-mail: dev-h...@spark.apache.org > >> > >>> > >> > >>> > >> > > > >> > > > >> > > > --------------------------------------------------------------------- > >> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> > > For additional commands, e-mail: dev-h...@spark.apache.org > >> > > > >> > > >> > > > > >