Thanks for the feedback. I think this will address at least some of the problems you are describing: https://github.com/apache/spark/pull/13425
On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rmarsc...@localytics.com> wrote: > Hi, > > I've been working on transitioning from RDD to Datasets in our codebase in > anticipation of being able to leverage features of 2.0. > > I'm having a lot of difficulties with the impedance mismatches between how > outer joins worked with RDD versus Dataset. The Dataset joins feel like a > big step backwards IMO. With RDD, leftOuterJoin would give you Option types > of the results from the right side of the join. This follows idiomatic > Scala avoiding nulls and was easy to work with. > > Now with Dataset there is only joinWith where you specify the join type, > but it lost all the semantics of identifying missing data from outer joins. > I can write some enriched methods on Dataset with an implicit class to > abstract messiness away if Dataset nulled out all mismatching data from an > outer join, however the problem goes even further in that the values aren't > always null. Integer, for example, defaults to -1 instead of null. Now it's > completely ambiguous what data in the join was actually there versus > populated via this atypical semantic. > > Are there additional options available to work around this issue? I can > convert to RDD and back to Dataset but that's less than ideal. > > Thanks, > -- > *Richard Marscher* > Senior Software Engineer > Localytics > Localytics.com <http://localytics.com/> | Our Blog > <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | > Facebook <http://facebook.com/localytics> | LinkedIn > <http://www.linkedin.com/company/1148792?trk=tyah> >