Re: [PR] [SPARK-46219][SQL] Unwrap cast in join predicates [spark]

via GitHub Sat, 27 Jan 2024 07:56:52 -0800


wangyum commented on PR #44133:
URL: https://github.com/apache/spark/pull/44133#issuecomment-1913205769


   > @wangyum good point. We need to be more careful when optimizing bucketed 
joins. I don't quite understand why we can't optimize it in the logical phase, 
because we don't know which one is the bucket column?
   
   We can add the bucket into to attributes but it won't solve all cases:
   <img width="1372" alt="image" 
src="https://github.com/apache/spark/assets/5399861/7f395849-d2e2-473b-8090-57f058680c93";>
   
   1. We also need to add bucket info to alias:
   ```sql
      create table t1(id int) using parquet CLUSTERED BY (id) INTO 100 buckets;
      create table t2(id bigint) using parquet;
      select * from (select id as id1 from t1) tmp1 join (select id as id2 from 
t2) tmp2 on tmp1.id1 = tmp2.id2;
   ```
   
   2. It is possible that the partition will be on a non-bucket column later:
   ```sql
      create table t1(id bigint, id2 bigint) using parquet;
      create table t2(id int, id2 int) using parquet CLUSTERED BY (id) INTO 100 
buckets;
      select * from (SELECT *, row_number() OVER (PARTITION BY id2 ORDER BY id) 
AS rn FROM t1) tmp1 join (SELECT *, row_number() OVER (PARTITION BY id2 ORDER 
BY id) AS rn FROM t2) tmp2 on tmp1.id = tmp2.id;
   ```
   
   3. It is better to use child output partitioning. Because we can also 
optimize non-bucketed column:
      ```sql
      create table t1(id int) using parquet;
      create table t2(id bigint) using parquet;
      SELECT * FROM (SELECT *, row_number() OVER (PARTITION BY id ORDER BY id) 
AS rn FROM t1) t JOIN t2 ON t.id = t2.id WHERE rn = 1;
      ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46219][SQL] Unwrap cast in join predicates [spark]

Reply via email to