jorgecarleitao edited a comment on pull request #8727:
URL: https://github.com/apache/arrow/pull/8727#issuecomment-731764988


   AFAIK pyspark does not desambiguate:
   
   ```python
   import pyspark
   
   with pyspark.SparkContext() as sc:
       spark = pyspark.sql.SQLContext(sc)
   
       df = spark.createDataFrame([
           [1, 2],
           [2, 3],
       ], schema=["id", "id1"])
   
       df1 = spark.createDataFrame([
           [1, 2],
           [1, 3],
       ], schema=["id", "id1"])
   
       df.join(df1, on="id").show()
   ```
   
   yields 
   
   ```
   +---+---+---+                                                                
   
   | id|id1|id1|
   +---+---+---+
   |  1|  2|  2|
   |  1|  2|  3|
   +---+---+---+
   ```
   
   on `pyspark==2.4.6`
   
   In pyspark, writing `df.join(df1, on="id").select("id1")` errors because the 
select can't tell which column to select. This IMO is poor judgment: the join 
itself does not crash, but operating on the resulting table crashes.
   
   I am generally against desambiguation because doing so changes the schema 
only when columns collide (or do we always add some `left_`?) In general, 
colliding columns requires the user to always desambiguate them, either before 
the statement (via alias) or after the statement (via `?.column_name`). Raising 
an error IMO is the best possible outcome as it requires the user to be 
explicit about what they want.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to