While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library.
Matei > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote: > > "The context is that SchemaRDD is becoming a common data format used for > bringing data into Spark from external systems, and used for various > components of Spark, e.g. MLlib's new pipeline API." > > i agree. this to me also implies it belongs in spark core, not sql > > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < > michaelma...@yahoo.com.invalid> wrote: > >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area >> Spark Meetup YouTube contained a wealth of background information on this >> idea (mostly from Patrick and Reynold :-). >> >> https://www.youtube.com/watch?v=YWppYPWznSQ >> >> ________________________________ >> From: Patrick Wendell <pwend...@gmail.com> >> To: Reynold Xin <r...@databricks.com> >> Cc: "dev@spark.apache.org" <dev@spark.apache.org> >> Sent: Monday, January 26, 2015 4:01 PM >> Subject: Re: renaming SchemaRDD -> DataFrame >> >> >> One thing potentially not clear from this e-mail, there will be a 1:1 >> correspondence where you can get an RDD to/from a DataFrame. >> >> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <r...@databricks.com> wrote: >>> Hi, >>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to >>> get the community's opinion. >>> >>> The context is that SchemaRDD is becoming a common data format used for >>> bringing data into Spark from external systems, and used for various >>> components of Spark, e.g. MLlib's new pipeline API. We also expect more >> and >>> more users to be programming directly against SchemaRDD API rather than >> the >>> core RDD API. SchemaRDD, through its less commonly used DSL originally >>> designed for writing test cases, always has the data-frame like API. In >>> 1.3, we are redesigning the API to make the API usable for end users. >>> >>> >>> There are two motivations for the renaming: >>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD. >>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even >>> though it would contain some RDD functions like map, flatMap, etc), and >>> calling it Schema*RDD* while it is not an RDD is highly confusing. >> Instead. >>> DataFrame.rdd will return the underlying RDD for all RDD methods. >>> >>> >>> My understanding is that very few users program directly against the >>> SchemaRDD API at the moment, because they are not well documented. >> However, >>> oo maintain backward compatibility, we can create a type alias DataFrame >>> that is still named SchemaRDD. This will maintain source compatibility >> for >>> Scala. That said, we will have to update all existing materials to use >>> DataFrame rather than SchemaRDD. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org