Yes, when a DataFrame is cached in memory, it's stored in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar.

Cheng

On 1/29/15 1:24 PM, Koert Kuipers wrote:
to me the word DataFrame does come with certain expectations. one of them
is that the data is stored columnar. in R data.frame internally uses a list
of sequences i think, but since lists can have labels its more like a
SortedMap[String, Array[_]]. this makes certain operations very cheap (such
as adding a column).

in Spark the closest thing would be a data structure where per Partition
the data is also stored columnar. does spark SQL already use something like
that? Evan mentioned "Spark SQL columnar compression", which sounds like
it. where can i find that?

thanks

On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.git...@gmail.com> wrote:

+1.... having proper NA support is much cleaner than using null, at
least the Java null.

On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:
You've got to be a little bit careful here. "NA" in systems like R or
pandas
may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com>
wrote:
Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.git...@gmail.com>
wrote:

I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com>
wrote:
It shouldn't change the data source api at all because data sources
create
RDD[Row], and that gets converted into a DataFrame automatically
(previously
to SchemaRDD).



https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
One thing that will break the data source API in 1.3 is the location
of
types. Types were previously defined in sql.catalyst.types, and now
moved to
sql.types. After 1.3, sql.catalyst is hidden from users, and all
public
APIs
have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.git...@gmail.com
wrote:
Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff
in
Spark SQL from being more useful is, at least from previous talks
with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library,
designed
from the outset to be a persistent storage friendly.  May be one
day
it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more
details.
-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <r...@databricks.com>
wrote:
Alright I have merged the patch (
https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter
of
fact
most were for it). We can still change it if somebody lays out a
strong
argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
<matei.zaha...@gmail.com>
wrote:

The type alias means your methods can specify either type and
they
will
work. It's just another name for the same type. But Scaladocs
and
such
will
show DataFrame as the type.

Matei

On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:
Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have
to
change
our code to migrade from schema to dataframe. Unless we have
an
implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>:

Dirceu,

That is not possible because one cannot overload return
types.
SQLContext.parquetFile (and many other methods) needs to
return
some
type,
and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called
SchemaRDD
to
not
break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

Can't the SchemaRDD remain the same, but deprecated, and be
removed
in
the
release 1.5(+/- 1)  for example, and the new code been added
to
DataFrame?
With this, we don't impact in existing code for the next few
releases.



2015-01-27 0:02 GMT-02:00 Kushal Datta
<kushal.da...@gmail.com>:

I want to address the issue that Matei raised about the
heavy
lifting
required for a full SQL support. It is amazing that even
after
30
years
of
research there is not a single good open source columnar
database
like
Vertica. There is a column store option in MySQL, but it is
not
nearly
as
sophisticated as Vertica or MonetDB. But there's a true
need
for
such
a
system. I wonder why so and it's high time to change that.
On Jan 26, 2015 5:47 PM, "Sandy Ryza"
<sandy.r...@cloudera.com>
wrote:
Both SchemaRDD and DataFrame sound fine to me, though I
like
the
former
slightly better because it's more descriptive.

Even if SchemaRDD's needs to rely on Spark SQL under the
covers,
it
would
be more clear from a user-facing perspective to at least
choose a
package
name for it that omits "sql".

I would also be in favor of adding a separate Spark Schema
module
for
Spark
SQL to rely on, but I imagine that might be too large a
change
at
this
point?

-Sandy

On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
matei.zaha...@gmail.com>
wrote:

(Actually when we designed Spark SQL we thought of giving
it
another
name,
like Spark Schema, but we decided to stick with SQL since
that
was
the
most
obvious use case to many users.)

Matei

On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
matei.zaha...@gmail.com>
wrote:
While it might be possible to move this concept to Spark
Core
long-term,
supporting structured data efficiently does require
quite a
bit
of
the
infrastructure in Spark SQL, such as query planning and
columnar
storage.
The intent of Spark SQL though is to be more than a SQL
server
--
it's
meant to be a library for manipulating structured data.
Since
this
is
possible to build over the core API, it's pretty natural
to
organize it
that way, same as Spark Streaming is a library.
Matei

On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
ko...@tresata.com>
wrote:
"The context is that SchemaRDD is becoming a common
data
format
used
for
bringing data into Spark from external systems, and
used
for
various
components of Spark, e.g. MLlib's new pipeline API."

i agree. this to me also implies it belongs in spark
core,
not
sql
On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
michaelma...@yahoo.com.invalid> wrote:

And in the off chance that anyone hasn't seen it yet,
the
Jan.
13
Bay
Area
Spark Meetup YouTube contained a wealth of background
information
on
this
idea (mostly from Patrick and Reynold :-).

https://www.youtube.com/watch?v=YWppYPWznSQ

________________________________
From: Patrick Wendell <pwend...@gmail.com>
To: Reynold Xin <r...@databricks.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Sent: Monday, January 26, 2015 4:01 PM
Subject: Re: renaming SchemaRDD -> DataFrame


One thing potentially not clear from this e-mail,
there
will
be
a
1:1
correspondence where you can get an RDD to/from a
DataFrame.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
r...@databricks.com>
wrote:
Hi,

We are considering renaming SchemaRDD -> DataFrame in
1.3,
and
wanted
to
get the community's opinion.

The context is that SchemaRDD is becoming a common
data
format
used
for
bringing data into Spark from external systems, and
used
for
various
components of Spark, e.g. MLlib's new pipeline API.
We
also
expect
more
and
more users to be programming directly against
SchemaRDD
API
rather
than
the
core RDD API. SchemaRDD, through its less commonly
used
DSL
originally
designed for writing test cases, always has the
data-frame
like
API.
In
1.3, we are redesigning the API to make the API
usable
for
end
users.

There are two motivations for the renaming:

1. DataFrame seems to be a more self-evident name
than
SchemaRDD.
2. SchemaRDD/DataFrame is actually not going to be an
RDD
anymore
(even
though it would contain some RDD functions like map,
flatMap,
etc),
and
calling it Schema*RDD* while it is not an RDD is
highly
confusing.
Instead.
DataFrame.rdd will return the underlying RDD for all
RDD
methods.

My understanding is that very few users program
directly
against
the
SchemaRDD API at the moment, because they are not
well
documented.
However,
oo maintain backward compatibility, we can create a
type
alias
DataFrame
that is still named SchemaRDD. This will maintain
source
compatibility
for
Scala. That said, we will have to update all existing
materials to
use
DataFrame rather than SchemaRDD.


---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@spark.apache.org
For additional commands, e-mail:
dev-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@spark.apache.org
For additional commands, e-mail:
dev-h...@spark.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail:
dev-h...@spark.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to