Re: renaming SchemaRDD -> DataFrame

Cheng Lian Thu, 29 Jan 2015 14:02:19 -0800

Yes, when a DataFrame is cached in memory, it's stored in an efficientcolumnar format. And you can also easily persist it on disk usingParquet, which is also columnar.


Cheng


On 1/29/15 1:24 PM, Koert Kuipers wrote:

to me the word DataFrame does come with certain expectations. one of them
is that the data is stored columnar. in R data.frame internally uses a list
of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap (such
as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use something like
that? Evan mentioned "Spark SQL columnar compression", which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.git...@gmail.com> wrote:

+1.... having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

You've got to be a little bit careful here. "NA" in systems like R or

pandas

may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com>

wrote:

Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.git...@gmail.com>
wrote:

I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com>
wrote:

It shouldn't change the data source api at all because data sources

create

RDD[Row], and that gets converted into a DataFrame automatically

(previously

to SchemaRDD).

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

One thing that will break the data source API in 1.3 is the location
of
types. Types were previously defined in sql.catalyst.types, and now

moved to

sql.types. After 1.3, sql.catalyst is hidden from users, and all
public

APIs

have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.git...@gmail.com

wrote:

Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff

in

Spark SQL from being more useful is, at least from previous talks
with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library,

designed

from the outset to be a persistent storage friendly.  May be one

day

it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more

details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <r...@databricks.com>

wrote:

Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter

of

fact

most were for it). We can still change it if somebody lays out a

strong

argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
<matei.zaha...@gmail.com>
wrote:

The type alias means your methods can specify either type and

they

will

work. It's just another name for the same type. But Scaladocs

and

such

will
show DataFrame as the type.

Matei

On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <

dirceu.semigh...@gmail.com> wrote:

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have
to
change
our code to migrade from schema to dataframe. Unless we have

an

implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>:

Dirceu,

That is not possible because one cannot overload return

types.

SQLContext.parquetFile (and many other methods) needs to

return

some

type,

and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called
SchemaRDD
to

not

break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

Can't the SchemaRDD remain the same, but deprecated, and be

removed

in

the

release 1.5(+/- 1)  for example, and the new code been added
to

DataFrame?

With this, we don't impact in existing code for the next few
releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta
<kushal.da...@gmail.com>:

I want to address the issue that Matei raised about the

heavy

lifting
required for a full SQL support. It is amazing that even
after

years

of

research there is not a single good open source columnar

database

like
Vertica. There is a column store option in MySQL, but it is
not
nearly

as

sophisticated as Vertica or MonetDB. But there's a true

need

for
such

system. I wonder why so and it's high time to change that.
On Jan 26, 2015 5:47 PM, "Sandy Ryza"
<sandy.r...@cloudera.com>

wrote:

Both SchemaRDD and DataFrame sound fine to me, though I

like

the

former

slightly better because it's more descriptive.

Even if SchemaRDD's needs to rely on Spark SQL under the

covers,

it

would

be more clear from a user-facing perspective to at least

choose a

package

name for it that omits "sql".

I would also be in favor of adding a separate Spark Schema

module

for

Spark

SQL to rely on, but I imagine that might be too large a
change

at

this

point?

-Sandy

On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <

matei.zaha...@gmail.com>

wrote:

(Actually when we designed Spark SQL we thought of giving
it
another

name,

like Spark Schema, but we decided to stick with SQL since
that
was

the

most

obvious use case to many users.)

Matei

On Jan 26, 2015, at 5:31 PM, Matei Zaharia <

matei.zaha...@gmail.com>

wrote:

While it might be possible to move this concept to Spark
Core

long-term,

supporting structured data efficiently does require

quite a

bit

of

the

infrastructure in Spark SQL, such as query planning and

columnar

storage.

The intent of Spark SQL though is to be more than a SQL
server
--

it's

meant to be a library for manipulating structured data.
Since
this

is

possible to build over the core API, it's pretty natural

to

organize it

that way, same as Spark Streaming is a library.

Matei

On Jan 26, 2015, at 4:26 PM, Koert Kuipers <

ko...@tresata.com>

wrote:

"The context is that SchemaRDD is becoming a common

data

format

used

for

bringing data into Spark from external systems, and

used

for

various

components of Spark, e.g. MLlib's new pipeline API."

i agree. this to me also implies it belongs in spark
core,

not

sql

On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
michaelma...@yahoo.com.invalid> wrote:

And in the off chance that anyone hasn't seen it yet,
the
Jan.

Bay

Area

Spark Meetup YouTube contained a wealth of background

information

on

this

idea (mostly from Patrick and Reynold :-).

https://www.youtube.com/watch?v=YWppYPWznSQ

________________________________
From: Patrick Wendell <pwend...@gmail.com>
To: Reynold Xin <r...@databricks.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Sent: Monday, January 26, 2015 4:01 PM
Subject: Re: renaming SchemaRDD -> DataFrame


One thing potentially not clear from this e-mail,

there

will

be

1:1

correspondence where you can get an RDD to/from a

DataFrame.


On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <

r...@databricks.com>

wrote:

Hi,

We are considering renaming SchemaRDD -> DataFrame in
1.3,
and

wanted

to

get the community's opinion.

The context is that SchemaRDD is becoming a common

data

format

used

for

bringing data into Spark from external systems, and
used

for

various

components of Spark, e.g. MLlib's new pipeline API.

We

also

expect

more

and

more users to be programming directly against

SchemaRDD

API

rather

than

the

core RDD API. SchemaRDD, through its less commonly

used

DSL

originally

designed for writing test cases, always has the
data-frame
like

API.

In

1.3, we are redesigning the API to make the API

usable

for
end

users.


There are two motivations for the renaming:

1. DataFrame seems to be a more self-evident name

than

SchemaRDD.

2. SchemaRDD/DataFrame is actually not going to be an
RDD

anymore

(even

though it would contain some RDD functions like map,
flatMap,

etc),

and

calling it Schema*RDD* while it is not an RDD is

highly

confusing.

Instead.

DataFrame.rdd will return the underlying RDD for all
RDD

methods.


My understanding is that very few users program
directly

against

the

SchemaRDD API at the moment, because they are not

well

documented.

However,

oo maintain backward compatibility, we can create a
type
alias

DataFrame

that is still named SchemaRDD. This will maintain
source

compatibility

for

Scala. That said, we will have to update all existing

materials to

use

DataFrame rather than SchemaRDD.

---------------------------------------------------------------------

To unsubscribe, e-mail:

dev-unsubscr...@spark.apache.org

For additional commands, e-mail:
dev-h...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe, e-mail:

dev-unsubscr...@spark.apache.org

For additional commands, e-mail:
dev-h...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail:

dev-h...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Reply via email to