Enrico Minack created SPARK-42132:
-------------------------------------
Summary: DeduplicateRelations rule breaks plan when co-grouping
the same DataFrame
Key: SPARK-42132
URL: https://issues.apache.org/jira/browse/SPARK-42132
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.2.3, 3.3.1, 3.3.0, 3.1.3, 3.0.3, 3.4.0
Reporter: Enrico Minack
Co-grouping two DataFrames that share references breaks on the
DeduplicateRelations rule:
{code:java}
val df = spark.range(3)
val left_grouped_df = df.groupBy("id").as[Long, Long]
val right_grouped_df = df.groupBy("id").as[Long, Long]
val cogroup_df = left_grouped_df.cogroup(right_grouped_df) {
case (key, left, right) => left
}
cogroup_df.explain()
{code}
{code:java}
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SerializeFromObject [input[0, bigint, false] AS value#12L]
+- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L],
[id#13L], [id#13L], obj#11: bigint
:- !Sort [id#13L ASC NULLS FIRST], false, 0
: +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS,
[plan_id=16]
: +- Range (0, 3, step=1, splits=16)
+- Sort [id#13L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS,
[plan_id=17]
+- Range (0, 3, step=1, splits=16)
{code}
The DataFrame cannot be computed:
{code:java}
cogroup_df.show()
{code}
{code:java}
java.lang.IllegalStateException: Couldn't find id#13L in [id#0L]
{code}
The rule replaces `id#0L` on the right side with `id#13L` while replacing all
occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to the
left side and should not be replaced. Further, `id#0L` of the right
deserializer is not replaced.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]