Re: [discuss] DataFrame vs Dataset in Spark 2.0

Chester Chen Thu, 25 Feb 2016 15:31:13 -0800

vote for Option 1.
  1)  Since 2.0 is major API, we are expecting some API changes,
  2)  It helps long term code base maintenance with short term pain on Java
side
  3) Not quite sure how large the code base is using Java DataFrame APIs.






On Thu, Feb 25, 2016 at 3:23 PM, Reynold Xin <r...@databricks.com> wrote:

> When we first introduced Dataset in 1.6 as an experimental API, we wanted
> to merge Dataset/DataFrame but couldn't because we didn't want to break the
> pre-existing DataFrame API (e.g. map function should return Dataset, rather
> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
> and Dataset.
>
> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two
> ways to implement this:
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
>
> I'm wondering what you think about this. The pros and cons I can think of
> are:
>
>
> Option 1. Make DataFrame a type alias for Dataset[Row]
>
> + Cleaner conceptually, especially in Scala. It will be very clear what
> libraries or applications need to do, and we won't see type mismatches
> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
> + A lot less code
> - Breaks source compatibility for the DataFrame API in Java, and binary
> compatibility for Scala/Java
>
>
> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>
> The pros/cons are basically the inverse of Option 1.
>
> + In most cases, can maintain source compatibility for the DataFrame API
> in Java, and binary compatibility for Scala/Java
> - A lot more code (1000+ loc)
> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
> into a function that expects a DataFrame
>
>
> The concerns are mostly with Scala/Java. For Python, it is very easy to
> maintain source compatibility for both (there is no concept of binary
> compatibility), and for R, we are only supporting the DataFrame operations
> anyway because that's more familiar interface for R users outside of Spark.
>
>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reply via email to