That kind of stuff is likely fixed in 2.0. If you can get a reproduction working there it would be very helpful if you could open a JIRA.
On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com> wrote: > A quick unit test attempt didn't get far replacing map with as[], I'm only > working against 1.6.1 at the moment though, I was going to try 2.0 but I'm > having a hard time building a working spark-sql jar from source, the only > ones I've managed to make are intended for the full assembly fat jar. > > > Example of the error from calling joinWith as left_outer and then > .as[(Option[T], U]) where T and U are Int and Int. > > [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0, > StructType(StructField(_1,IntegerType,true), > StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1, > StructType(StructField(_1,IntegerType,true), > StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class > scala.Tuple2),None) > [info] :- decodeusingserializer(input[0, > StructType(StructField(_1,IntegerType,true), > StructField(_2,IntegerType,true))],scala.Option,true) > [info] : +- input[0, StructType(StructField(_1,IntegerType,true), > StructField(_2,IntegerType,true))] > [info] +- decodeusingserializer(input[1, > StructType(StructField(_1,IntegerType,true), > StructField(_2,IntegerType,true))],scala.Option,true) > [info] +- input[1, StructType(StructField(_1,IntegerType,true), > StructField(_2,IntegerType,true))] > > Cause: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 32, Column 60: No applicable constructor/method > found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; > candidates are: "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer > java.nio.ByteBuffer.wrap(byte[], int, int)" > > The generated code is passing InternalRow objects into the ByteBuffer > > Starting from two Datasets of types Dataset[(Int, Int)] with expression > $"left._1" === $"right._1". I'll have to spend some time getting a better > understanding of this analysis phase, but hopefully I can come up with > something. > > On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Option should place nicely with encoders, but its always possible there >> are bugs. I think those function signatures are slightly more expensive >> (one extra object allocation) and its not as java friendly so we probably >> don't want them to be the default. >> >> That said, I would like to enable that kind of sugar while still taking >> advantage of all the optimizations going on under the covers. Can you get >> it to work if you use `as[...]` instead of `map`? >> >> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher < >> rmarsc...@localytics.com> wrote: >> >>> Ah thanks, I missed seeing the PR for >>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became >>> null objects then I can implement methods that will map those back to >>> results that align closer to the RDD interface. >>> >>> As a follow on, I'm curious about thoughts regarding enriching the >>> Dataset join interface versus a package or users sugaring for themselves. I >>> haven't considered the implications of what the optimizations datasets, >>> tungsten, and/or bytecode gen can do now regarding joins so I may be >>> missing a critical benefit there around say avoiding Options in favor of >>> nulls. If nothing else, I guess Option doesn't have a first class Encoder >>> or DataType yet and maybe for good reasons. >>> >>> I did find the RDD join interface elegant, though. In the ideal world an >>> API comparable the following would be nice: >>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06 >>> >>> >>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com >>> > wrote: >>> >>>> Thanks for the feedback. I think this will address at least some of >>>> the problems you are describing: >>>> https://github.com/apache/spark/pull/13425 >>>> >>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher < >>>> rmarsc...@localytics.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I've been working on transitioning from RDD to Datasets in our >>>>> codebase in anticipation of being able to leverage features of 2.0. >>>>> >>>>> I'm having a lot of difficulties with the impedance mismatches between >>>>> how outer joins worked with RDD versus Dataset. The Dataset joins feel >>>>> like >>>>> a big step backwards IMO. With RDD, leftOuterJoin would give you Option >>>>> types of the results from the right side of the join. This follows >>>>> idiomatic Scala avoiding nulls and was easy to work with. >>>>> >>>>> Now with Dataset there is only joinWith where you specify the join >>>>> type, but it lost all the semantics of identifying missing data from outer >>>>> joins. I can write some enriched methods on Dataset with an implicit class >>>>> to abstract messiness away if Dataset nulled out all mismatching data from >>>>> an outer join, however the problem goes even further in that the values >>>>> aren't always null. Integer, for example, defaults to -1 instead of null. >>>>> Now it's completely ambiguous what data in the join was actually there >>>>> versus populated via this atypical semantic. >>>>> >>>>> Are there additional options available to work around this issue? I >>>>> can convert to RDD and back to Dataset but that's less than ideal. >>>>> >>>>> Thanks, >>>>> -- >>>>> *Richard Marscher* >>>>> Senior Software Engineer >>>>> Localytics >>>>> Localytics.com <http://localytics.com/> | Our Blog >>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> >>>>> | Facebook <http://facebook.com/localytics> | LinkedIn >>>>> <http://www.linkedin.com/company/1148792?trk=tyah> >>>>> >>>> >>>> >>> >>> >>> -- >>> *Richard Marscher* >>> Senior Software Engineer >>> Localytics >>> Localytics.com <http://localytics.com/> | Our Blog >>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> >>> | Facebook <http://facebook.com/localytics> | LinkedIn >>> <http://www.linkedin.com/company/1148792?trk=tyah> >>> >> >> > > > -- > *Richard Marscher* > Senior Software Engineer > Localytics > Localytics.com <http://localytics.com/> | Our Blog > <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | > Facebook <http://facebook.com/localytics> | LinkedIn > <http://www.linkedin.com/company/1148792?trk=tyah> >