Github user yucai commented on the issue:
https://github.com/apache/spark/pull/21156
A classic scenario could be like below:
```
SELECT
...
FROM
lstg_item item,
lstg_item_vrtn v
WHERE
item.auct_end_dt = CAST(SUBSTR('2018-04-19 00:00:00',1,10) AS DATE)
AND item.item_id = v.item_id
AND item.auct_end_dt = v.auct_end_dt;
```
`lstg_item` is a really big table and `item_id` is its primary key.
If we bucket on its `item_id`:
- No data skew. Each partition will have the same data.
- Before this PR, the above query needs extra shuffle on big table. After
this PR, we can save that shuffle.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]