Re: Dataset Outer Join vs RDD Outer Join

2016-06-07 Thread Richard Marscher
For anyone following along the chain went private for a bit, but there were still issues with the bytecode generation in the 2.0-preview so this JIRA was created: https://issues.apache.org/jira/browse/SPARK-15786 On Mon, Jun 6, 2016 at 1:11 PM, Michael Armbrust wrote: >

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Michael Armbrust
That kind of stuff is likely fixed in 2.0. If you can get a reproduction working there it would be very helpful if you could open a JIRA. On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher wrote: > A quick unit test attempt didn't get far replacing map with as[], I'm

Re: Dataset Outer Join vs RDD Outer Join

2016-06-06 Thread Richard Marscher
A quick unit test attempt didn't get far replacing map with as[], I'm only working against 1.6.1 at the moment though, I was going to try 2.0 but I'm having a hard time building a working spark-sql jar from source, the only ones I've managed to make are intended for the full assembly fat jar.

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Option should place nicely with encoders, but its always possible there are bugs. I think those function signatures are slightly more expensive (one extra object allocation) and its not as java friendly so we probably don't want them to be the default. That said, I would like to enable that kind

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
Ah thanks, I missed seeing the PR for https://issues.apache.org/jira/browse/SPARK-15441. If the rows became null objects then I can implement methods that will map those back to results that align closer to the RDD interface. As a follow on, I'm curious about thoughts regarding enriching the

Re: Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Michael Armbrust
Thanks for the feedback. I think this will address at least some of the problems you are describing: https://github.com/apache/spark/pull/13425 On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher wrote: > Hi, > > I've been working on transitioning from RDD to Datasets in

Dataset Outer Join vs RDD Outer Join

2016-06-01 Thread Richard Marscher
Hi, I've been working on transitioning from RDD to Datasets in our codebase in anticipation of being able to leverage features of 2.0. I'm having a lot of difficulties with the impedance mismatches between how outer joins worked with RDD versus Dataset. The Dataset joins feel like a big step