Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I'm working around it like this: val testMapped2 = test1.rdd.map(t => t.copy(id = t.id + 1)).toDF.as[Test] testMapped2.as("t1").joinWith(testMapped2.as("t2"), $"t1.id" === $"t2.id ").show Switching from RDD, then mapping, then going back to DS seemed to avoid the issue. On Fri, May 27, 2016 at

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Koert Kuipers
i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT) but i couldn't reproduce it nicely. my observation was that joins of 2 datasets that were derived from the same datasource gave this kind of trouble. i changed my datasource from val to def (so it got created twice) as a

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Ted Yu
I tried master branch : scala> val testMapped = test.map(t => t.copy(id = t.id + 1)) testMapped: org.apache.spark.sql.Dataset[Test] = [id: int] scala> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $" t2.id").show org.apache.spark.sql.AnalysisException: cannot resolve '`t1.id`'

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
Oops, screwed up my example. This is what it should be: case class Test(id: Int) val test = Seq( Test(1), Test(2), Test(3) ).toDS test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show val testMapped = test.map(t => t.copy(id = t.id + 1))

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I figured it out the trigger. Turns out it wasn't because I loaded it from the database, it was because the first thing I do after loading is to lower case all the strings. After a Dataset has been mapped, the resulting Dataset can't be self joined. Here's a test case that illustrates the issue:

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I stand corrected. I just created a test table with a single int field to test with and the Dataset loaded from that works with no issues. I'll see if I can track down exactly what the difference might be. On Fri, May 27, 2016 at 10:29 AM Tim Gautier wrote: > I'm using

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
I'm using 1.6.1. I'm not sure what good fake data would do since it doesn't seem to have anything to do with the data itself. It has to do with how the Dataset was created. Both datasets have exactly the same data in them, but the one created from a sql query fails where the one created from a

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Ted Yu
Which release of Spark are you using ? Is it possible to come up with fake data that shows what you described ? Thanks On Fri, May 27, 2016 at 8:24 AM, Tim Gautier wrote: > Unfortunately I can't show exactly the data I'm using, but this is what > I'm seeing: > > I have

I'm pretty sure this is a Dataset bug

2016-05-27 Thread Tim Gautier
Unfortunately I can't show exactly the data I'm using, but this is what I'm seeing: I have a case class 'Product' that represents a table in our database. I load that data via sqlContext.read.format("jdbc").options(...).load.as[Product] and register it in a temp table 'product'. For testing, I