[jira] [Updated] (HIVE-28798) Bucket Map Join partially using partition transforms

Shohei Okumiya (Jira) Mon, 03 Mar 2025 09:52:15 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-28798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shohei Okumiya updated HIVE-28798:
----------------------------------
    Description: 
The current implementation requires all bucket transforms to be projected. 
Unlike Hive's native bucketing, Iceberg allows multiple bucket keys to be 
decomposed into multiple partition transforms. For example,
{code:java}
CREATE TABLE srcbucket_big(key1 int, key2 string, value string, id int)
PARTITIONED BY SPEC(bucket(4, key1), bucket(8, key2)) STORED BY ICEBERG; {code}
Currently, BMJ is applied when both key1 and key2 are used.
{code:java}
SELECT a.key1, a.key2, a.id
FROM srcbucket_big a
JOIN src_small b ON a.key1 = b.key1 AND a.key2 = b.key2
ORDER BY a.id; {code}
Considering the storage layout of Apache Iceberg, the following query can also 
leverage BMJ.
{code:java}
SELECT a.key1, a.id
FROM srcbucket_big a
JOIN src_small b ON a.key1 = b.key1
ORDER BY a.id; {code}
This optimization would be helpful when HIVE-28414 extended the optimization to 
non-bucket transforms, such as daily partitioning that are typically not used 
as a join key.

  was:
The current implementation requires all bucket transforms to be projected. 
Unlike Hive's native bucketing, Iceberg allows multiple bucket keys to be 
decomposed into multiple partition transforms. For example,
{code:java}
CREATE TABLE srcbucket_big(key1 int, key2 string, value string, id int)
PARTITIONED BY SPEC(bucket(4, key1), bucket(8, key2)) STORED BY ICEBERG; {code}
Currently, BMJ is applied when both key1 and key2 are used.
{code:java}
SELECT a.key1, a.key2, a.id
FROM srcbucket_big a
JOIN src_small b ON a.key1 = b.key1 AND a.key2 = b.key2
ORDER BY a.id; {code}
Considering the storage layout of Apache Iceberg, the following query can also 
leverage BMJ.
{code:java}
SELECT a.key1, a.id
FROM srcbucket_big a
JOIN src_small b ON a.key1 = b.key1
ORDER BY a.id; {code}
This optimization would be helpful when 
[HIVE-28414|https://issues.apache.org/jira/browse/HIVE-28414] extended the 
optimization to non-bucket transforms.


> Bucket Map Join partially using partition transforms
> ----------------------------------------------------
>
>                 Key: HIVE-28798
>                 URL: https://issues.apache.org/jira/browse/HIVE-28798
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Iceberg integration
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>              Labels: pull-request-available
>
> The current implementation requires all bucket transforms to be projected. 
> Unlike Hive's native bucketing, Iceberg allows multiple bucket keys to be 
> decomposed into multiple partition transforms. For example,
> {code:java}
> CREATE TABLE srcbucket_big(key1 int, key2 string, value string, id int)
> PARTITIONED BY SPEC(bucket(4, key1), bucket(8, key2)) STORED BY ICEBERG; 
> {code}
> Currently, BMJ is applied when both key1 and key2 are used.
> {code:java}
> SELECT a.key1, a.key2, a.id
> FROM srcbucket_big a
> JOIN src_small b ON a.key1 = b.key1 AND a.key2 = b.key2
> ORDER BY a.id; {code}
> Considering the storage layout of Apache Iceberg, the following query can 
> also leverage BMJ.
> {code:java}
> SELECT a.key1, a.id
> FROM srcbucket_big a
> JOIN src_small b ON a.key1 = b.key1
> ORDER BY a.id; {code}
> This optimization would be helpful when HIVE-28414 extended the optimization 
> to non-bucket transforms, such as daily partitioning that are typically not 
> used as a join key.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28798) Bucket Map Join partially using partition transforms

Reply via email to