[jira] [Commented] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists

Wenxin Zhou (Jira) Sun, 02 Jul 2023 12:07:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739403#comment-17739403
 ]


Wenxin Zhou commented on SPARK-43513:
-------------------------------------

Some observations
1. This is not unique to 3.4.0 release. I'm seeing the same issue when testing 
the same procedure on PySpark 3.3
2. This issue might not unique to "withColumnRenamed" method. More broadly 
speaking, we do allow creating a pyspark frame with duplicate column names. 
However, we only allow selecting columns by the "names". (Pyspark doesn't have 
the "iloc" dataframe selection as Pandas library does). As a result, when we 
have a pyspark dataframe with duplicate column names, select with the duplicate 
column name will always throw the exception.

For example 
{code:java}
>>> df3 = 
>>> spark.createDataFrame([('Monday',25,27,29,30),('Tuesday',40,38,36,34),('Wednesday',18,20,22,17),('Thursday',25,27,29,19)],['day','temperature','temperature','temperature','temperature'])

>>> df3.show(2)
+-------+-----------+-----------+-----------+-----------+
|    day|temperature|temperature|temperature|temperature|
+-------+-----------+-----------+-----------+-----------+
| Monday|         25|         27|         29|         30|
|Tuesday|         40|         38|         36|         34|
+-------+-----------+-----------+-----------+-----------+

>>> df3.select('temperature')
site-packages/pyspark/sql/utils.py", line 196, in deco raise converted from None
pyspark.sql.utils.AnalysisException: Reference 'temperature' is ambiguous, 
could be: temperature, temperature, temperature, temperature. {code}
One known workaround for this is to use dataframe.toDF method to rename the 
column names, as illustrated here 
https://www.geeksforgeeks.org/pyspark-dataframe-distinguish-columns-with-duplicated-name/

> withColumnRenamed duplicates columns if new column already exists
> -----------------------------------------------------------------
>
>                 Key: SPARK-43513
>                 URL: https://issues.apache.org/jira/browse/SPARK-43513
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Frederik Paradis
>            Priority: Major
>
> withColumnRenamed should either replace the column when new column already 
> exists or should specify the specificity in the documentation. See the code 
> below as an example of the current state.
> {code:python}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", 
> "test_score"])
> r = df.withColumnRenamed("test_score", "score")
> print(r)  # DataFrame[id: bigint, score: double, score: double]
> # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could 
> be: score, score.
> print(r.select("score"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists

Reply via email to