Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reynold Xin Thu, 25 Feb 2016 16:27:12 -0800

It might make sense, but this option seems to carry all the cons of Option
2, and yet doesn't provide compatibility for Java?


On Thu, Feb 25, 2016 at 3:31 PM, Michael Malak <[email protected]>
wrote:

> Would it make sense (in terms of feasibility, code organization, and
> politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra
> lines to a Java compatibility layer/class?
>
>
> ------------------------------
> *From:* Reynold Xin <[email protected]>
> *To:* "[email protected]" <[email protected]>
> *Sent:* Thursday, February 25, 2016 4:23 PM
> *Subject:* [discuss] DataFrame vs Dataset in Spark 2.0
>
> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>
>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reply via email to