Tim Gautier created SPARK-15620:
-----------------------------------
Summary: Dataset.map creates a dataset that can't be self-joined
Key: SPARK-15620
URL: https://issues.apache.org/jira/browse/SPARK-15620
Project: Spark
Issue Type: Bug
Affects Versions: 1.6.1
Environment: EC2, Spark-shell
Reporter: Tim Gautier
Given this case class and Dataset:
{code}
case class Test(id: Int)
val test = Seq(
Test(1),
Test(2),
Test(3)
).toDS
{code}
'test' can be joined with itself successfully
{code}
test.as("t1").joinWith(test.as("t2"), $"t1.id" === $"t2.id").show
{code}
However, mapping 'test' like this
{code}
val testMapped = test.map(t => t.copy(id = t.id + 1))
{code}
results in a new Dataset that can't be joined to itself
{code}
testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" === $"t2.id").show
{code}
Yields:
{noformat}
scala> testMapped.as("t1").joinWith(testMapped.as("t2"), $"t1.id" ===
$"t2.id").show
org.apache.spark.sql.AnalysisException: cannot resolve 't1.id' given input
columns: [id];
{noformat}
This also throws an error:
{code}
val testMapped2 = test.map(_.id)
testMapped2.as("t1").joinWith(testMapped2.as("t2"), $"t1.value" ===
$"t2.value").show
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]