Option should place nicely with encoders, but its always possible there are bugs. I think those function signatures are slightly more expensive (one extra object allocation) and its not as java friendly so we probably don't want them to be the default.
That said, I would like to enable that kind of sugar while still taking advantage of all the optimizations going on under the covers. Can you get it to work if you use `as[...]` instead of `map`? On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <rmarsc...@localytics.com> wrote: > Ah thanks, I missed seeing the PR for > https://issues.apache.org/jira/browse/SPARK-15441. If the rows became > null objects then I can implement methods that will map those back to > results that align closer to the RDD interface. > > As a follow on, I'm curious about thoughts regarding enriching the Dataset > join interface versus a package or users sugaring for themselves. I haven't > considered the implications of what the optimizations datasets, tungsten, > and/or bytecode gen can do now regarding joins so I may be missing a > critical benefit there around say avoiding Options in favor of nulls. If > nothing else, I guess Option doesn't have a first class Encoder or DataType > yet and maybe for good reasons. > > I did find the RDD join interface elegant, though. In the ideal world an > API comparable the following would be nice: > https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06 > > > On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Thanks for the feedback. I think this will address at least some of the >> problems you are describing: https://github.com/apache/spark/pull/13425 >> >> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher < >> rmarsc...@localytics.com> wrote: >> >>> Hi, >>> >>> I've been working on transitioning from RDD to Datasets in our codebase >>> in anticipation of being able to leverage features of 2.0. >>> >>> I'm having a lot of difficulties with the impedance mismatches between >>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like >>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option >>> types of the results from the right side of the join. This follows >>> idiomatic Scala avoiding nulls and was easy to work with. >>> >>> Now with Dataset there is only joinWith where you specify the join type, >>> but it lost all the semantics of identifying missing data from outer joins. >>> I can write some enriched methods on Dataset with an implicit class to >>> abstract messiness away if Dataset nulled out all mismatching data from an >>> outer join, however the problem goes even further in that the values aren't >>> always null. Integer, for example, defaults to -1 instead of null. Now it's >>> completely ambiguous what data in the join was actually there versus >>> populated via this atypical semantic. >>> >>> Are there additional options available to work around this issue? I can >>> convert to RDD and back to Dataset but that's less than ideal. >>> >>> Thanks, >>> -- >>> *Richard Marscher* >>> Senior Software Engineer >>> Localytics >>> Localytics.com <http://localytics.com/> | Our Blog >>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> >>> | Facebook <http://facebook.com/localytics> | LinkedIn >>> <http://www.linkedin.com/company/1148792?trk=tyah> >>> >> >> > > > -- > *Richard Marscher* > Senior Software Engineer > Localytics > Localytics.com <http://localytics.com/> | Our Blog > <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | > Facebook <http://facebook.com/localytics> | LinkedIn > <http://www.linkedin.com/company/1148792?trk=tyah> >