Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reynold Xin Fri, 26 Feb 2016 11:52:06 -0800

That's actually not Row vs non-Row.

It's just primitive vs non-primitive. Primitives get automatically
flattened, to avoid having to type ._1 all the time.


On Fri, Feb 26, 2016 at 2:06 AM, Sun, Rui <rui....@intel.com> wrote:

> Thanks for the explaination.
>
>
>
> What confusing me is the different internal semantic of Dataset on non-Row
> type (primitive types for example) and Row type:
>
>
>
> Dataset[Int] is internally actually Dataset[Row(value:Int)]
>
>
>
> scala> val ds = sqlContext.createDataset(Seq(1,2,3))
>
> ds: org.apache.spark.sql.Dataset[Int] = [value: int]
>
>
>
> scala> ds.schema.json
>
> res17: String =
> {"type":"struct","fields":[{"name":"value","type":"integer","nullable":false,"metadata":{}}]}
>
>
>
> But obviously Dataset[Row] is not internally Dataset[Row(value: Row)].
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Friday, February 26, 2016 3:55 PM
> *To:* Sun, Rui <rui....@intel.com>
> *Cc:* Koert Kuipers <ko...@tresata.com>; dev@spark.apache.org
>
> *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0
>
>
>
> The join and joinWith are just two different join semantics, and is not
> about Dataset vs DataFrame.
>
>
>
> join is the relational join, where fields are flattened; joinWith is more
> like a tuple join, where the output has two fields that are nested.
>
>
>
> So you can do
>
>
>
> Dataset[A] joinWith Dataset[B] = Dataset[(A, B)]
>
>
> DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)]
>
>
>
> Dataset[A] join Dataset[B] = Dataset[Row]
>
>
>
> DataFrame[A] join DataFrame[B] = Dataset[Row]
>
>
>
>
>
>
>
> On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui <rui....@intel.com> wrote:
>
> Vote for option 2.
>
> Source compatibility and binary compatibility are very important from
> user’s perspective.
>
> It ‘s unfair for Java developers that they don’t have DataFrame
> abstraction. As you said, sometimes it is more natural to think about
> DataFrame.
>
>
>
> I am wondering if conceptually there is slight subtle difference between
> DataFrame and Dataset[Row]? For example,
>
> Dataset[T] joinWith Dataset[U]  produces Dataset[(T, U)]
>
> So,
>
> Dataset[Row] joinWith Dataset[Row]  produces Dataset[(Row, Row)]
>
>
>
> While
>
> DataFrame join DataFrame is still DataFrame of Row?
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Friday, February 26, 2016 8:52 AM
> *To:* Koert Kuipers <ko...@tresata.com>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0
>
>
>
> Yes - and that's why source compatibility is broken.
>
>
>
> Note that it is not just a "convenience" thing. Conceptually DataFrame is
> a Dataset[Row], and for some developers it is more natural to think about
> "DataFrame" rather than "Dataset[Row]".
>
>
>
> If we were in C++, DataFrame would've been a type alias for Dataset[Row]
> too, and some methods would return DataFrame (e.g. sql method).
>
>
>
>
>
>
>
> On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset<Row>?
>
>
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <r...@databricks.com> wrote:
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
>
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
>
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
>
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>
> + A lot less code
>
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
>
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
>
> The pros/cons are basically the inverse of Option 1.
>
>
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
>
> - A lot more code (1000+ loc)
>
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
>
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>
>
>
>
>
>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reply via email to