[
https://issues.apache.org/jira/browse/SPARK-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385255#comment-15385255
]
Liwei Lin edited comment on SPARK-16464 at 7/20/16 3:30 AM:
------------------------------------------------------------
Hi [~shivaram], [~dongjoon], [[email protected]]: in scala, {{withColumn}}'s
behavior is "adding a column or replacing the existing column that has the same
name" (please refer to
[Dataset.withColumn|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1708]):
{code}
// results are the same for Spark 1.6.1 and current master
// some setups here
val ds0 = sqlContext.range(1, 4)
ds0.show()
/* prints
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
*/
val ds1 = ds0.withColumn("newId", $"id")
ds1.show()
/* prints
+---+-----+
| id|newId|
+---+-----+
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+
*/
val ds2 = ds1.withColumn("newId", $"id" * 2)
ds2.show()
/* prints
+---+-----+
| id|newId|
+---+-----+
| 1| 2|
| 2| 4|
| 3| 6|
+---+-----+
*/
{code}
was (Author: proflin):
In scala, {{withColumn}}'s behavior is "adding a column or replacing the
existing column that has the same name" (please refer to
{Dataset.withColumn|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1708}):
{code}
// results are the same for Spark 1.6.1 and current master
// some setups here
val ds0 = sqlContext.range(1, 4)
ds0.show()
/* prints
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
*/
val ds1 = ds0.withColumn("newId", $"id")
ds1.show()
/* prints
+---+-----+
| id|newId|
+---+-----+
| 1| 1|
| 2| 2|
| 3| 3|
+---+-----+
*/
val ds2 = ds1.withColumn("newId", $"id" * 2)
ds2.show()
/* prints
+---+-----+
| id|newId|
+---+-----+
| 1| 2|
| 2| 4|
| 3| 6|
+---+-----+
*/
{code}
> withColumn() allows illegal creation of duplicate column names on DataFrame
> ---------------------------------------------------------------------------
>
> Key: SPARK-16464
> URL: https://issues.apache.org/jira/browse/SPARK-16464
> Project: Spark
> Issue Type: Bug
> Components: SparkR, SQL
> Affects Versions: 1.6.1
> Environment: Databricks.com
> Reporter: Neil Dewar
> Priority: Minor
>
> If I take an existing DataFrame, I am permitted to use withColumn() to create
> a duplicate column name. I assume this should be illegal, and withColumn
> should be prevented from permitting this. Some functions subsequently fail
> due to the duplicate column names. Example:
> sdfCar <- createDataFrame(sqlContext, mtcars)
> sdfCar1 <- withColumn(sdfCar, "isEfficient", sdfCar$mpg<=20)
> sdfCar1 <- withColumn(sdfCar1, "isEfficient", ifelse(sdfCar1$mpg ==
> sdfCar1$mpg,1,0))
> sdfCar2 <- subset(sdfCar1, select=sdfCar1$isEfficient)
> # subset() command fails with message: "Reference 'isEfficient' is ambiguous"
> Note: I only know if this is SparkR - it might affect other languages APIs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]