[
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun reassigned SPARK-30218:
-------------------------------------
Assignee: Dongjoon Hyun
> Columns used in inequality conditions for joins not resolved correctly in
> case of common lineage
> ------------------------------------------------------------------------------------------------
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.4, 2.4.4
> Reporter: Francesco Cavrini
> Assignee: Dongjoon Hyun
> Priority: Major
> Labels: correctness
>
> When columns from different data-frames that have a common lineage are used
> in inequality conditions in joins, they are not resolved correctly. In
> particular, both the column from the left DF and the one from the right DF
> are resolved to the same column, thus making the inequality condition either
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A",
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]],
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"],
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1| A| 0|id1| B| 1|
> |id1| A| 0|id1| B| 5|
> |id1| A| 1|id1| B| 1|
> |id1| A| 1|id1| B| 5|
> |id2| A| 2|id2| B| 10|
> |id2| A| 3|id2| B| 10|
> which violates the condition that the timestamp from the right DF should be
> between df_left["timestamp"] and df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) &&
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)))))
> :- SubqueryAlias `left`
> : +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
> +- Filter (kind#37 = B)
> +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been
> correctly resolved.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]