The join and joinWith are just two different join semantics, and is not about Dataset vs DataFrame.
join is the relational join, where fields are flattened; joinWith is more like a tuple join, where the output has two fields that are nested. So you can do Dataset[A] joinWith Dataset[B] = Dataset[(A, B)] DataFrame[A] joinWith DataFrame[B] = Dataset[(Row, Row)] Dataset[A] join Dataset[B] = Dataset[Row] DataFrame[A] join DataFrame[B] = Dataset[Row] On Thu, Feb 25, 2016 at 11:37 PM, Sun, Rui <rui....@intel.com> wrote: > Vote for option 2. > > Source compatibility and binary compatibility are very important from > user’s perspective. > > It ‘s unfair for Java developers that they don’t have DataFrame > abstraction. As you said, sometimes it is more natural to think about > DataFrame. > > > > I am wondering if conceptually there is slight subtle difference between > DataFrame and Dataset[Row]? For example, > > Dataset[T] joinWith Dataset[U] produces Dataset[(T, U)] > > So, > > Dataset[Row] joinWith Dataset[Row] produces Dataset[(Row, Row)] > > > > While > > DataFrame join DataFrame is still DataFrame of Row? > > > > *From:* Reynold Xin [mailto:r...@databricks.com] > *Sent:* Friday, February 26, 2016 8:52 AM > *To:* Koert Kuipers <ko...@tresata.com> > *Cc:* dev@spark.apache.org > *Subject:* Re: [discuss] DataFrame vs Dataset in Spark 2.0 > > > > Yes - and that's why source compatibility is broken. > > > > Note that it is not just a "convenience" thing. Conceptually DataFrame is > a Dataset[Row], and for some developers it is more natural to think about > "DataFrame" rather than "Dataset[Row]". > > > > If we were in C++, DataFrame would've been a type alias for Dataset[Row] > too, and some methods would return DataFrame (e.g. sql method). > > > > > > > > On Thu, Feb 25, 2016 at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote: > > since a type alias is purely a convenience thing for the scala compiler, > does option 1 mean that the concept of DataFrame ceases to exist from a > java perspective, and they will have to refer to Dataset<Row>? > > > > On Thu, Feb 25, 2016 at 6:23 PM, Reynold Xin <r...@databricks.com> wrote: > > When we first introduced Dataset in 1.6 as an experimental API, we wanted > to merge Dataset/DataFrame but couldn't because we didn't want to break the > pre-existing DataFrame API (e.g. map function should return Dataset, rather > than RDD). In Spark 2.0, one of the main API changes is to merge DataFrame > and Dataset. > > > > Conceptually, DataFrame is just a Dataset[Row]. In practice, there are two > ways to implement this: > > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > > > > I'm wondering what you think about this. The pros and cons I can think of > are: > > > > > > Option 1. Make DataFrame a type alias for Dataset[Row] > > > > + Cleaner conceptually, especially in Scala. It will be very clear what > libraries or applications need to do, and we won't see type mismatches > (e.g. a function expects DataFrame, but user is passing in Dataset[Row] > > + A lot less code > > - Breaks source compatibility for the DataFrame API in Java, and binary > compatibility for Scala/Java > > > > > > Option 2. DataFrame as a concrete class that extends Dataset[Row] > > > > The pros/cons are basically the inverse of Option 1. > > > > + In most cases, can maintain source compatibility for the DataFrame API > in Java, and binary compatibility for Scala/Java > > - A lot more code (1000+ loc) > > - Less cleaner, and can be confusing when users pass in a Dataset[Row] > into a function that expects a DataFrame > > > > > > The concerns are mostly with Scala/Java. For Python, it is very easy to > maintain source compatibility for both (there is no concept of binary > compatibility), and for R, we are only supporting the DataFrame operations > anyway because that's more familiar interface for R users outside of Spark. > > > > > > > > >