Pushpender Garg created HIVE-24237:
--------------------------------------

             Summary: Multi level/dimensional bucketing in Hive
                 Key: HIVE-24237
                 URL: https://issues.apache.org/jira/browse/HIVE-24237
             Project: Hive
          Issue Type: New Feature
          Components: Database/Schema
    Affects Versions: 3.1.2, 3.1.1
            Reporter: Pushpender Garg


Hive can considerably optimize the execution of certain queries like filter, 
aggregations, joins, if bucketed columns are used in query for these 
operations. Buckets can be created on multiple columns as well where hash 
function is computed after merging all bucket columns. 

The problem is that if buckets are created on multiple columns but query is on 
subset of those columns then hive doesn't optimize that query. Unless all 
bucket columns are used as predicate, bucketing will not be utilized. Solution 
proposed in this document is to solve this problem such that even if subset of 
bucket columns are used still hive will be able to optimize that query.

Instead of storing data in single dimensional buckets it can be stored in 
multi-dimensional buckets when multiple columns are given. If subset of 
bucketed columns is used as predicates in query then based on hash value of 
individual columns, appropriate buckets can be identified and only those 
buckets will be scanned. This will enable optimizations even when single column 
or few columns are used in querying  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to