Tim Gautier created SPARK-15620:
-----------------------------------

             Summary: Dataset.map creates a dataset that can't be self-joined
                 Key: SPARK-15620
                 URL: https://issues.apache.org/jira/browse/SPARK-15620
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.6.1
         Environment: EC2, Spark-shell
            Reporter: Tim Gautier


Given this case class and Dataset:
{code}
case class Test(id: Int)
val test = Seq(
  Test(1),
  Test(2),
  Test(3)
).toDS
{code}

'test' can be joined with itself successfully
{code}
test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
{code}

However, mapping 'test' like this
{code}
val testMapped = test.map(t => t.copy(id = t.id + 1))
{code}
results in a new Dataset that can't be joined to itself
{code}
testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"t2.id").show
{code}
Yields:
{noformat}
scala> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === 
$"t2.id").show
org.apache.spark.sql.AnalysisException: cannot resolve 't1.id' given input 
columns: [id];
{noformat}

This also throws an error:
{code}
val testMapped2 = test.map(_.id)
testMapped2.as("t1").joinWith(testMapped2.as("t2"), $"t1.value" === 
$"t2.value").show
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to