Github user dilipbiswal commented on the issue:
https://github.com/apache/spark/pull/13483
@viirya This is a design decision. So far, both ways are not perfect.
In my mind, we have to consider the use cases here. If users want to have
duplicate columns, they should not use the duplicate names. Do you think this
makes sense?
That means, we should not remove the duplicate in the following scenario:
```
df.groupBy("col1").agg($"col1".as("col1_replica"), count("*"))
The expected output schema:
+----+------------+--------+
|col1|col1_replica|count(1)|
+----+------------+--------+
```
If they do not change the column name, I am unable to find any usage
senario for duplicating the columns.
```
df.groupBy("col1").agg($"col1", count("*"))
df.groupBy("col1").agg(count("*"))
The expected output schema of the above two:
+----+--------+
|col1|count(1)|
+----+--------+
```
What is your opinions?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]