Re: Dataset Outer Join vs RDD Outer Join

Michael Armbrust Mon, 06 Jun 2016 10:12:06 -0700

That kind of stuff is likely fixed in 2.0.  If you can get a reproduction
working there it would be very helpful if you could open a JIRA.


On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> A quick unit test attempt didn't get far replacing map with as[], I'm only
> working against 1.6.1 at the moment though, I was going to try 2.0 but I'm
> having a hard time building a working spark-sql jar from source, the only
> ones I've managed to make are intended for the full assembly fat jar.
>
>
> Example of the error from calling joinWith as left_outer and then
> .as[(Option[T], U]) where T and U are Int and Int.
>
> [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
> scala.Tuple2),None)
> [info] :- decodeusingserializer(input[0,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true)
> [info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))]
> [info] +- decodeusingserializer(input[1,
> StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))],scala.Option,true)
> [info]    +- input[1, StructType(StructField(_1,IntegerType,true),
> StructField(_2,IntegerType,true))]
>
> Cause: java.util.concurrent.ExecutionException: java.lang.Exception:
> failed to compile: org.codehaus.commons.compiler.CompileException: File
> 'generated.java', Line 32, Column 60: No applicable constructor/method
> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
> candidates are: "public static java.nio.ByteBuffer
> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
> java.nio.ByteBuffer.wrap(byte[], int, int)"
>
> The generated code is passing InternalRow objects into the ByteBuffer
>
> Starting from two Datasets of types Dataset[(Int, Int)] with expression
> $"left._1" === $"right._1". I'll have to spend some time getting a better
> understanding of this analysis phase, but hopefully I can come up with
> something.
>
> On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Option should place nicely with encoders, but its always possible there
>> are bugs.  I think those function signatures are slightly more expensive
>> (one extra object allocation) and its not as java friendly so we probably
>> don't want them to be the default.
>>
>> That said, I would like to enable that kind of sugar while still taking
>> advantage of all the optimizations going on under the covers.  Can you get
>> it to work if you use `as[...]` instead of `map`?
>>
>> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
>> rmarsc...@localytics.com> wrote:
>>
>>> Ah thanks, I missed seeing the PR for
>>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
>>> null objects then I can implement methods that will map those back to
>>> results that align closer to the RDD interface.
>>>
>>> As a follow on, I'm curious about thoughts regarding enriching the
>>> Dataset join interface versus a package or users sugaring for themselves. I
>>> haven't considered the implications of what the optimizations datasets,
>>> tungsten, and/or bytecode gen can do now regarding joins so I may be
>>> missing a critical benefit there around say avoiding Options in favor of
>>> nulls. If nothing else, I guess Option doesn't have a first class Encoder
>>> or DataType yet and maybe for good reasons.
>>>
>>> I did find the RDD join interface elegant, though. In the ideal world an
>>> API comparable the following would be nice:
>>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>>>
>>>
>>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com
>>> > wrote:
>>>
>>>> Thanks for the feedback.  I think this will address at least some of
>>>> the problems you are describing:
>>>> https://github.com/apache/spark/pull/13425
>>>>
>>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>>>> rmarsc...@localytics.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've been working on transitioning from RDD to Datasets in our
>>>>> codebase in anticipation of being able to leverage features of 2.0.
>>>>>
>>>>> I'm having a lot of difficulties with the impedance mismatches between
>>>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel 
>>>>> like
>>>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option
>>>>> types of the results from the right side of the join. This follows
>>>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>>>
>>>>> Now with Dataset there is only joinWith where you specify the join
>>>>> type, but it lost all the semantics of identifying missing data from outer
>>>>> joins. I can write some enriched methods on Dataset with an implicit class
>>>>> to abstract messiness away if Dataset nulled out all mismatching data from
>>>>> an outer join, however the problem goes even further in that the values
>>>>> aren't always null. Integer, for example, defaults to -1 instead of null.
>>>>> Now it's completely ambiguous what data in the join was actually there
>>>>> versus populated via this atypical semantic.
>>>>>
>>>>> Are there additional options available to work around this issue? I
>>>>> can convert to RDD and back to Dataset but that's less than ideal.
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> *Richard Marscher*
>>>>> Senior Software Engineer
>>>>> Localytics
>>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Richard Marscher*
>>> Senior Software Engineer
>>> Localytics
>>> Localytics.com <http://localytics.com/> | Our Blog
>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>
>>
>>
>
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Re: Dataset Outer Join vs RDD Outer Join

Reply via email to