Nikita Eshkeev created SPARK-43021:
--------------------------------------
Summary: Shuffle happens when Coalesce Buckets should occur
Key: SPARK-43021
URL: https://issues.apache.org/jira/browse/SPARK-43021
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.3.1
Reporter: Nikita Eshkeev
h1. What I did
I define the following code:
{{from pyspark.sql import SparkSession}}
{{spark = (}}
{{ SparkSession}}
{{ .builder}}
{{ .appName("Bucketing")}}
{{ .master("local[4]")}}
{{ .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", True)}}
{{ .config("spark.sql.autoBroadcastJoinThreshold", "-1")}}
{{ .getOrCreate()}}
{{)}}
{{df1 = spark.range(0, 100)}}
{{df2 = spark.range(0, 100, 2)}}
{{df1.write.bucketBy(4, "id").mode("overwrite").saveAsTable("t1")}}
{{df2.write.bucketBy(2, "id").mode("overwrite").saveAsTable("t2")}}
{{t1 = spark.table("t1")}}
{{t2 = spark.table("t2")}}
{{t2.join(t1, "id").explain()}}
h1. What happened
There is an Exchange node in the join plan
h1. What is expected
The plan should not contain any Exchange/Shuffle nodes, because {{t1}}'s number
of buckets is 4 and {{t2}}'s number of buckets is 2, and their ratio is 2 which
is less than 4 ({{spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio}})
and
[CoalesceBucketsInJoin|https://github.com/apache/spark/blob/c9878a212958bc54be529ef99f5e5d1ddf513ec8/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala]
should be applied
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]