[
https://issues.apache.org/jira/browse/SPARK-43021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709448#comment-17709448
]
zzzzming95 commented on SPARK-43021:
------------------------------------
PR:https://github.com/apache/spark/pull/40688
> Shuffle happens when Coalesce Buckets should occur
> --------------------------------------------------
>
> Key: SPARK-43021
> URL: https://issues.apache.org/jira/browse/SPARK-43021
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.3.1
> Reporter: Nikita Eshkeev
> Priority: Minor
>
> h1. What I did
> I define the following code:
> {{from pyspark.sql import SparkSession}}
> {{spark = (}}
> {{ SparkSession}}
> {{ .builder}}
> {{ .appName("Bucketing")}}
> {{ .master("local[4]")}}
> {{ .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", True)}}
> {{ .config("spark.sql.autoBroadcastJoinThreshold", "-1")}}
> {{ .getOrCreate()}}
> {{)}}
> {{df1 = spark.range(0, 100)}}
> {{df2 = spark.range(0, 100, 2)}}
> {{df1.write.bucketBy(4, "id").mode("overwrite").saveAsTable("t1")}}
> {{df2.write.bucketBy(2, "id").mode("overwrite").saveAsTable("t2")}}
> {{t1 = spark.table("t1")}}
> {{t2 = spark.table("t2")}}
> {{t2.join(t1, "id").explain()}}
> h1. What happened
> There is an Exchange node in the join plan
> h1. What is expected
> The plan should not contain any Exchange/Shuffle nodes, because {{t1}}'s
> number of buckets is 4 and {{t2}}'s number of buckets is 2, and their ratio
> is 2 which is less than 4
> ({{spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio}}) and
> [CoalesceBucketsInJoin|https://github.com/apache/spark/blob/c9878a212958bc54be529ef99f5e5d1ddf513ec8/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala]
> should be applied
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]