It might make sense, but this option seems to carry all the cons of Option 2, and yet doesn't provide compatibility for Java?
On Thu, Feb 25, 2016 at 3:31 PM, Michael Malak <michaelma...@yahoo.com> wrote: > Would it make sense (in terms of feasibility, code organization, and > politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra > lines to a Java compatibility layer/class? > > > ------------------------------ > *From:* Reynold Xin <r...@databricks.com> > *To:* "dev@spark.apache.org" <dev@spark.apache.org> > *Sent:* Thursday, February 25, 2016 4:23 PM > *Subject:* [discuss] DataFrame vs Dataset in Spark 2.0 > > When we first introduced Dataset in 1.6 as an experimental API, we wanted > to merge Dataset/DataFrame but couldn't because we didn't want to break the > pre-existing DataFrame API (e.g. map function should return Dataset, rather > than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame > and Dataset. > > Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two > ways to implement this: > > Option 1. Make DataFrame a type alias for Dataset[Row] > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > I'm wondering what you think about this. The pros and cons I can think of > are: > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > + Cleaner conceptually, especially in Scala. It will be very clear what > libraries or applications need to do, and we won't see type mismatches > (e.g. a function expects DataFrame, but user is passing in Dataset[Row] > + A lot less code > - Breaks source compatibility for the DataFrame API in Java, and binary > compatibility for Scala/Java > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > The pros/cons are basically the inverse of Option 1. > > + In most cases, can maintain source compatibility for the DataFrame API > in Java, and binary compatibility for Scala/Java > - A lot more code (1000+ loc) > - Less cleaner, and can be confusing when users pass in a Dataset[Row] > into a function that expects a DataFrame > > > The concerns are mostly with Scala/Java. For Python, it is very easy to > maintain source compatibility for both (there is no concept of binary > compatibility), and for R, we are only supporting the DataFrame operations > anyway because that's more familiar interface for R users outside of Spark. > > > > >