Re: Dataset Outer Join vs RDD Outer Join

Michael Armbrust Wed, 01 Jun 2016 12:43:53 -0700

Option should place nicely with encoders, but its always possible there are
bugs.  I think those function signatures are slightly more expensive (one
extra object allocation) and its not as java friendly so we probably don't
want them to be the default.


That said, I would like to enable that kind of sugar while still taking
advantage of all the optimizations going on under the covers.  Can you get
it to work if you use `as[...]` instead of `map`?

On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> Ah thanks, I missed seeing the PR for
> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
> null objects then I can implement methods that will map those back to
> results that align closer to the RDD interface.
>
> As a follow on, I'm curious about thoughts regarding enriching the Dataset
> join interface versus a package or users sugaring for themselves. I haven't
> considered the implications of what the optimizations datasets, tungsten,
> and/or bytecode gen can do now regarding joins so I may be missing a
> critical benefit there around say avoiding Options in favor of nulls. If
> nothing else, I guess Option doesn't have a first class Encoder or DataType
> yet and maybe for good reasons.
>
> I did find the RDD join interface elegant, though. In the ideal world an
> API comparable the following would be nice:
> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>
>
> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Thanks for the feedback.  I think this will address at least some of the
>> problems you are describing: https://github.com/apache/spark/pull/13425
>>
>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>> rmarsc...@localytics.com> wrote:
>>
>>> Hi,
>>>
>>> I've been working on transitioning from RDD to Datasets in our codebase
>>> in anticipation of being able to leverage features of 2.0.
>>>
>>> I'm having a lot of difficulties with the impedance mismatches between
>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel like
>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>>> types of the results from the right side of the join. This follows
>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>
>>> Now with Dataset there is only joinWith where you specify the join type,
>>> but it lost all the semantics of identifying missing data from outer joins.
>>> I can write some enriched methods on Dataset with an implicit class to
>>> abstract messiness away if Dataset nulled out all mismatching data from an
>>> outer join, however the problem goes even further in that the values aren't
>>> always null. Integer, for example, defaults to -1 instead of null. Now it's
>>> completely ambiguous what data in the join was actually there versus
>>> populated via this atypical semantic.
>>>
>>> Are there additional options available to work around this issue? I can
>>> convert to RDD and back to Dataset but that's less than ideal.
>>>
>>> Thanks,
>>> --
>>> *Richard Marscher*
>>> Senior Software Engineer
>>> Localytics
>>> Localytics.com <http://localytics.com/> | Our Blog
>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>
>>
>>
>
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Re: Dataset Outer Join vs RDD Outer Join

Reply via email to