[GitHub] spark issue #21156: [SPARK-24087][SQL] Avoid shuffle when join keys are a su...

yucai Tue, 10 Jul 2018 02:36:48 -0700

Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/21156
  
    A classic scenario could be like below:
    ```
    SELECT
      ...
    FROM
      lstg_item item,
      lstg_item_vrtn v
    WHERE 
      item.auct_end_dt = CAST(SUBSTR('2018-04-19 00:00:00',1,10) AS DATE)
      AND item.item_id = v.item_id
      AND item.auct_end_dt = v.auct_end_dt;
    ```
    `lstg_item` is a really big table and `item_id` is its primary key.
    If we bucket on its `item_id`:
    - No data skew. Each partition will have the same data.
    - Before this PR, the above query needs extra shuffle on big table. After 
this PR, we can save that shuffle.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21156: [SPARK-24087][SQL] Avoid shuffle when join keys are a su...

Reply via email to