holdenk created SPARK-24780:
-------------------------------

             Summary: DataFrame.column_name should take into account DataFrame 
alias for future joins
                 Key: SPARK-24780
                 URL: https://issues.apache.org/jira/browse/SPARK-24780
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 2.4.0
            Reporter: holdenk


If we join a dataframe with another dataframe which has the same column name of 
the conditions (e.g. shared lineage on one of the conditions) even though the 
join condition may be written with the full name, the columns returned don't 
have the dataframe alias and as such will create a cross-join.

For example this currently works even if both posts_by_sampled_authors  &  
mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.

 
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
F.col("posts_by_sampled_authors.message_id")],
 "inner"){code}
 

But a similarly written expression:
{code:java}
posts_with_replies = posts_by_sampled_authors.join(
 mailing_list_posts_in_reply_to,
 [mailing_list_posts_in_reply_to.in_reply_to == 
posts_by_sampled_authors.message_id],
 "inner"){code}
will fail.

 

We could fix this by changing it so that dataframe.column in PySpark returns 
the fully qualified column reference if the dataframe has an alias.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to