[jira] [Updated] (HIVE-12337) Sorted Partitions

Maciek Kocon (JIRA) Wed, 04 Nov 2015 07:47:02 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-12337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maciek Kocon updated HIVE-12337:
--------------------------------
    Description: To apply 
[HIVE-12336|https://issues.apache.org/jira/browse/HIVE-12336]) it is a 
prerequisite to support for creating partitioned tables where partitions would 
store their data sorted by specified columns.  (was: Logically and functionally 
bucketing and partitioning are quite similar - both provide mechanism to 
segregate and separate the table's data based on its content. Thanks to that 
significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN 
are possible.
The difference seems to be imposed by design where the PARTITIONing is 
open/explicit while BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current 
RDBMS while BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT 
PARTITIONING".

Regardless of the fact that these two are recognised as two separate features 
available in Hive there should be nothing to prevent leveraging same existing 
query/join optimisations across the two.


①[Sort Merge] PARTITION Map join (no progress yet)
Enable Bucket Map Join or better, the Sort Merge Bucket Map Join equivalent 
optimisations when PARTITIONING is used exclusively or in combination with 
BUCKETING.

For JOIN conditions where partitioning criteria are used respectively:
            ⋮ 
FROM TabA JOIN TabB
   ON TabA.partCol1 = TabB.partCol2
   AND TabA.partCol2 = TabB.partCol2

the optimizer could/should choose to treat it the same way as with bucketed 
tables: ⋮ 
FROM TabC
  JOIN TabD
     ON TabC.clusteredByCol1 = TabD.clusteredByCol2
   AND TabC.clusteredByCol2 = TabD.clusteredByCol2

and use either Bucket Map Join or better, the Sort Merge Bucket Map Join. The 
latter would require capability to create sorted partitions first.

This is based on fact that same way as buckets translate to separate files, the 
partitions essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining 
corresponding partitions rather than whole data sets.

②BUCKET pruning (taken care by 
[HIVE-11525|https://issues.apache.org/jira/browse/HIVE-11525])
Enable partition PRUNING equivalent optimisation for queries on BUCKETED tables

Simplest example is for queries like:
"SELECT … FROM x WHERE colA=123123"
to read only the relevant bucket file rather than all file-buckets that belong 
to a table.)

> Sorted Partitions
> -----------------
>
>                 Key: HIVE-12337
>                 URL: https://issues.apache.org/jira/browse/HIVE-12337
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer, Physical Optimizer, SQL
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
>            Reporter: Maciek Kocon
>              Labels: gsoc2015
>
> To apply [HIVE-12336|https://issues.apache.org/jira/browse/HIVE-12336]) it is 
> a prerequisite to support for creating partitioned tables where partitions 
> would store their data sorted by specified columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-12337) Sorted Partitions

Reply via email to