Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reynold Xin Thu, 25 Feb 2016 16:53:32 -0800

Yes - and that's why source compatibility is broken.

Note that it is not just a "convenience" thing. Conceptually DataFrame is a
Dataset[Row], and for some developers it is more natural to think about
"DataFrame" rather than "Dataset[Row]".


If we were in C++, DataFrame would've been a type alias for Dataset[Row]
too, and some methods would return DataFrame (e.g. sql method).



On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:

> since a type alias is purely a convenience thing for the scala compiler,
> does option 1 mean that the concept of DataFrame ceases to exist from a
> java perspective, and they will have to refer to Dataset<Row>?
>
> On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> When we first introduced Dataset in 1.6 as an experimental API, we wanted
>> to merge Dataset/DataFrame but couldn't because we didn't want to break the
>> pre-existing DataFrame API (e.g. map function should return Dataset, rather
>> than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame
>> and Dataset.
>>
>> Conceptually, DataFrame is just a Dataset[Row]. In practice, there are
>> two ways to implement this:
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>>
>> I'm wondering what you think about this. The pros and cons I can think of
>> are:
>>
>>
>> Option 1. Make DataFrame a type alias for Dataset[Row]
>>
>> + Cleaner conceptually, especially in Scala. It will be very clear what
>> libraries or applications need to do, and we won't see type mismatches
>> (e.g. a function expects DataFrame, but user is passing in Dataset[Row]
>> + A lot less code
>> - Breaks source compatibility for the DataFrame API in Java, and binary
>> compatibility for Scala/Java
>>
>>
>> Option 2. DataFrame as a concrete class that extends Dataset[Row]
>>
>> The pros/cons are basically the inverse of Option 1.
>>
>> + In most cases, can maintain source compatibility for the DataFrame API
>> in Java, and binary compatibility for Scala/Java
>> - A lot more code (1000+ loc)
>> - Less cleaner, and can be confusing when users pass in a Dataset[Row]
>> into a function that expects a DataFrame
>>
>>
>> The concerns are mostly with Scala/Java. For Python, it is very easy to
>> maintain source compatibility for both (there is no concept of binary
>> compatibility), and for R, we are only supporting the DataFrame operations
>> anyway because that's more familiar interface for R users outside of Spark.
>>
>>
>>
>

Re: [discuss] DataFrame vs Dataset in Spark 2.0

Reply via email to