[
https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739403#comment-17739403
]
Wenxin Zhou commented on SPARK-43513:
-------------------------------------
Some observations
1. This is not unique to 3.4.0 release. I'm seeing the same issue when testing
the same procedure on PySpark 3.3
2. This issue might not unique to "withColumnRenamed" method. More broadly
speaking, we do allow creating a pyspark frame with duplicate column names.
However, we only allow selecting columns by the "names". (Pyspark doesn't have
the "iloc" dataframe selection as Pandas library does). As a result, when we
have a pyspark dataframe with duplicate column names, select with the duplicate
column name will always throw the exception.
For example
{code:java}
>>> df3 =
>>> spark.createDataFrame([('Monday',25,27,29,30),('Tuesday',40,38,36,34),('Wednesday',18,20,22,17),('Thursday',25,27,29,19)],['day','temperature','temperature','temperature','temperature'])
>>> df3.show(2)
+-------+-----------+-----------+-----------+-----------+
| day|temperature|temperature|temperature|temperature|
+-------+-----------+-----------+-----------+-----------+
| Monday| 25| 27| 29| 30|
|Tuesday| 40| 38| 36| 34|
+-------+-----------+-----------+-----------+-----------+
>>> df3.select('temperature')
site-packages/pyspark/sql/utils.py", line 196, in deco raise converted from None
pyspark.sql.utils.AnalysisException: Reference 'temperature' is ambiguous,
could be: temperature, temperature, temperature, temperature. {code}
One known workaround for this is to use dataframe.toDF method to rename the
column names, as illustrated here
https://www.geeksforgeeks.org/pyspark-dataframe-distinguish-columns-with-duplicated-name/
> withColumnRenamed duplicates columns if new column already exists
> -----------------------------------------------------------------
>
> Key: SPARK-43513
> URL: https://issues.apache.org/jira/browse/SPARK-43513
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.4.0
> Reporter: Frederik Paradis
> Priority: Major
>
> withColumnRenamed should either replace the column when new column already
> exists or should specify the specificity in the documentation. See the code
> below as an example of the current state.
> {code:python}
> from pyspark.sql import SparkSession
> spark =
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score",
> "test_score"])
> r = df.withColumnRenamed("test_score", "score")
> print(r) # DataFrame[id: bigint, score: double, score: double]
> # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could
> be: score, score.
> print(r.select("score"))
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]