wangyum commented on PR #44133: URL: https://github.com/apache/spark/pull/44133#issuecomment-1913205769
> @wangyum good point. We need to be more careful when optimizing bucketed joins. I don't quite understand why we can't optimize it in the logical phase, because we don't know which one is the bucket column? We can add the bucket into to attributes but it won't solve all cases: <img width="1372" alt="image" src="https://github.com/apache/spark/assets/5399861/7f395849-d2e2-473b-8090-57f058680c93"> 1. We also need to add bucket info to alias: ```sql create table t1(id int) using parquet CLUSTERED BY (id) INTO 100 buckets; create table t2(id bigint) using parquet; select * from (select id as id1 from t1) tmp1 join (select id as id2 from t2) tmp2 on tmp1.id1 = tmp2.id2; ``` 2. It is possible that the partition will be on a non-bucket column later: ```sql create table t1(id bigint, id2 bigint) using parquet; create table t2(id int, id2 int) using parquet CLUSTERED BY (id) INTO 100 buckets; select * from (SELECT *, row_number() OVER (PARTITION BY id2 ORDER BY id) AS rn FROM t1) tmp1 join (SELECT *, row_number() OVER (PARTITION BY id2 ORDER BY id) AS rn FROM t2) tmp2 on tmp1.id = tmp2.id; ``` 3. It is better to use child output partitioning. Because we can also optimize non-bucketed column: ```sql create table t1(id int) using parquet; create table t2(id bigint) using parquet; SELECT * FROM (SELECT *, row_number() OVER (PARTITION BY id ORDER BY id) AS rn FROM t1) t JOIN t2 ON t.id = t2.id WHERE rn = 1; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
