RE: [discuss] DataFrame vs Dataset in Spark 2.0

Sun, Rui Fri, 26 Feb 2016 02:07:17 -0800

Thanks for the explaination.

What confusing me is the different internal semantic of Dataset on non-Row type 
(primitive types for example) and Row type:


Dataset[Int] is internally actually Dataset[Row(value:Int)]

scala> val ds = sqlContext.createDataset(Seq(1,2,3))
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.schema.json
res17: String = 
{"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}

But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].

From: Reynold Xin [mailto:[email protected]]
Sent: Friday, February 26, 2016 3:55 PM
To: Sun, Rui <[email protected]>
Cc: Koert Kuipers <[email protected]>; [email protected]
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

The join and joinWith are just two different join semantics, and is not about 
Dataset vs DataFrame.

join is the relational join, where fields are flattened; joinWith is more like 
a tuple join, where the output has two fields that are nested.

So you can do

Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]

DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]

Dataset[A] join Dataset[B] = Dataset[Row]

DataFrame[A] join DataFrame[B] = Dataset[Row]



On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui 
<[email protected]<mailto:[email protected]>> wrote:
Vote for option 2.
Source compatibility and binary compatibility are very important from user’s 
perspective.
It ‘s unfair for Java developers that they don’t have DataFrame abstraction. As 
you said, sometimes it is more natural to think about DataFrame.

I am wondering if conceptually there is slight subtle difference between 
DataFrame and Dataset[Row]? For example,
Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
So,
Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]

While
DataFrame join DataFrame is still DataFrame of Row?

From: Reynold Xin [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, February 26, 2016 8:52 AM
To: Koert Kuipers <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [discuss] DataFrame vs Dataset in Spark 2.0

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a 
Dataset[Row], and for some developers it is more natural to think about 
"DataFrame" rather than "Dataset[Row]".

If we were in C++, DataFrame would've been a type alias for Dataset[Row] too, 
and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers 
<[email protected]<mailto:[email protected]>> wrote:
since a type alias is purely a convenience thing for the scala compiler, does 
option 1 mean that the concept of DataFrame ceases to exist from a java 
perspective, and they will have to refer to Dataset<Row>?

On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin 
<[email protected]<mailto:[email protected]>> wrote:
When we first introduced Dataset in 1.6 as an experimental API, we wanted to 
merge Dataset/DataFrame but couldn't because we didn't want to break the 
pre-existing DataFrame API (e.g. map function should return Dataset, rather 
than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame and 
Dataset.

Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two ways 
to implement this:

Option 1. Make DataFrame a type alias for Dataset[Row]

Option 2. DataFrame as a concrete class that extends Dataset[Row]


I'm wondering what you think about this. The pros and cons I can think of are:


Option 1. Make DataFrame a type alias for Dataset[Row]

+ Cleaner conceptually, especially in Scala. It will be very clear what 
libraries or applications need to do, and we won't see type mismatches (e.g. a 
function expects DataFrame, but user is passing in Dataset[Row]
+ A lot less code
- Breaks source compatibility for the DataFrame API in Java, and binary 
compatibility for Scala/Java


Option 2. DataFrame as a concrete class that extends Dataset[Row]

The pros/cons are basically the inverse of Option 1.

+ In most cases, can maintain source compatibility for the DataFrame API in 
Java, and binary compatibility for Scala/Java
- A lot more code (1000+ loc)
- Less cleaner, and can be confusing when users pass in a Dataset[Row] into a 
function that expects a DataFrame


The concerns are mostly with Scala/Java. For Python, it is very easy to 
maintain source compatibility for both (there is no concept of binary 
compatibility), and for R, we are only supporting the DataFrame operations 
anyway because that's more familiar interface for R users outside of Spark.

RE: [discuss] DataFrame vs Dataset in Spark 2.0

Reply via email to