wangyum commented on PR #44133:
URL: https://github.com/apache/spark/pull/44133#issuecomment-1913223027
> And to be clear, we can put the cast on either side. We want to put the
cast on the non-bucketed column side to save shuffle?
Generally speaking, yes. But we can also put it to bucketed column side:
```sql
create table t1(id decimal(18, 0)) using parquet CLUSTERED BY (id) INTO 100
buckets;
create table t2(id bigint) using parquet CLUSTERED BY (id) INTO 100 buckets;
select * from t1 join t2 on t1.id = t2.id;
```
The current physical Plan:
```
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [cast(id#10 as decimal(20,0))], [cast(id#11L as
decimal(20,0))], Inner
:- Sort [cast(id#10 as decimal(20,0)) ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(cast(id#10 as decimal(20,0)), 5),
ENSURE_REQUIREMENTS, [plan_id=38]
: +- Filter isnotnull(id#10)
: +- FileScan parquet spark_catalog.default.t1[id#10] Batched:
true, Bucketed: false (disabled by query planner)
+- Sort [cast(id#11L as decimal(20,0)) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(cast(id#11L as decimal(20,0)), 5),
ENSURE_REQUIREMENTS, [plan_id=42]
+- Filter isnotnull(id#11L)
+- FileScan parquet spark_catalog.default.t2[id#11L] Batched:
true, Bucketed: false (disabled by query planner)
```
Unwrap the cast in right join key:
```
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [cast(id#10 as bigint)], [id#11L], Inner
:- Sort [cast(id#10 as bigint) ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(cast(id#10 as bigint), 100),
ENSURE_REQUIREMENTS, [plan_id=38]
: +- Filter isnotnull(id#10)
: +- FileScan parquet spark_catalog.default.t1[id#10] Batched:
true, Bucketed: false (disabled by query planner)
+- Sort [id#11L ASC NULLS FIRST], false, 0
+- Filter isnotnull(id#11L)
+- FileScan parquet spark_catalog.default.t2[id#11L] Batched: true,
Bucketed: true, SelectedBucketsCount: 100 out of 100
```
Unwrap the cast in left join key:
```
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [id#10], [cast(id#11L as decimal(18,0))], Inner
:- Sort [id#10 ASC NULLS FIRST], false, 0
: +- Filter isnotnull(id#10)
: +- FileScan parquet spark_catalog.default.t1[id#10] Batched: true,
Bucketed: true, SelectedBucketsCount: 100 out of 100
+- Sort [cast(id#11L as decimal(18,0)) ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(cast(id#11L as decimal(18,0)), 100),
ENSURE_REQUIREMENTS, [plan_id=38]
+- Filter isnotnull(id#11L)
+- FileScan parquet spark_catalog.default.t2[id#11L] Batched:
true, Bucketed: false (disabled by query planner)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]