jorgecarleitao edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731764988
AFAIK pyspark does not desambiguate:
```python
import pyspark
with pyspark.SparkContext() as sc:
spark = pyspark.sql.SQLContext(sc)
df = spark.createDataFrame([
[1, 2],
[2, 3],
], schema=["id", "id1"])
df1 = spark.createDataFrame([
[1, 2],
[1, 3],
], schema=["id", "id1"])
df.join(df1, on="id").show()
```
yields
```
+---+---+---+
| id|id1|id1|
+---+---+---+
| 1| 2| 2|
| 1| 2| 3|
+---+---+---+
```
on `pyspark==2.4.6`
In pyspark, writing `df.join(df1, on="id").select("id1")` errors because the
select can't tell which column to select. This IMO is poor judgment: the join
itself does not crash, but operating on the resulting table crashes.
I am generally against desambiguation because doing so changes the schema
only when columns collide (or do we always add some `left_`?) In general,
colliding columns requires the user to always desambiguate them, either before
the statement (via alias) or after the statement (via `?.column_name`). Raising
an error IMO is the best possible outcome as it requires the user to be
explicit about what they want.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]