[
https://issues.apache.org/jira/browse/HIVE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maciek Kocon updated HIVE-12336:
--------------------------------
Description:
Logically and functionally bucketing and partitioning are quite similar - both
provide mechanism to segregate and separate the table's data based on its
content. Thanks to that significant further optimisations like [partition]
PRUNING or [bucket] MAP JOIN are possible.
The difference seems to be imposed by design where the PARTITIONing is
open/explicit while BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current
RDBMS while BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT
PARTITIONING".
Regardless of the fact that these two are recognised as two separate features
available in Hive there should be nothing to prevent leveraging same existing
query/join optimisations across the two.
PARTITION MAPJOIN
Use the same type of optimization as in SORT MERGE BUCKETED MAP JOIN for
partitioned tables.
The sort-merge join optimization could be performed when PARTITIONED tables
being joined are sorted and partitioned on the join columns.
The corresponding partitions are joined with each other at the mapper. If both
A and B have partitions set on their columns KEY, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the partition key='201512' for A
will traverse the corresponding partition for B. Traversing is possible if the
corresponding partitions are sorted on the same columns.
was:
Logically and functionally bucketing and partitioning are quite similar - both
provide mechanism to segregate and separate the table's data based on its
content. Thanks to that significant further optimisations like [partition]
PRUNING or [bucket] MAP JOIN are possible.
The difference seems to be imposed by design where the PARTITIONing is
open/explicit while BUCKETing is discrete/implicit.
Partitioning seems to be very common if not a standard feature in all current
RDBMS while BUCKETING seems to be HIVE specific only.
In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT
PARTITIONING".
Regardless of the fact that these two are recognised as two separate features
available in Hive there should be nothing to prevent leveraging same existing
query/join optimisations across the two.
①[Sort Merge] PARTITION Map join (no progress yet)
Enable Bucket Map Join or better, the Sort Merge Bucket Map Join equivalent
optimisations when PARTITIONING is used exclusively or in combination with
BUCKETING.
For JOIN conditions where partitioning criteria are used respectively:
⋮
FROM TabA JOIN TabB
ON TabA.partCol1 = TabB.partCol2
AND TabA.partCol2 = TabB.partCol2
the optimizer could/should choose to treat it the same way as with bucketed
tables: ⋮
FROM TabC
JOIN TabD
ON TabC.clusteredByCol1 = TabD.clusteredByCol2
AND TabC.clusteredByCol2 = TabD.clusteredByCol2
and use either Bucket Map Join or better, the Sort Merge Bucket Map Join. The
latter would require capability to create sorted partitions first.
This is based on fact that same way as buckets translate to separate files, the
partitions essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining
corresponding partitions rather than whole data sets.
②BUCKET pruning (taken care by
[HIVE-11525|https://issues.apache.org/jira/browse/HIVE-11525])
Enable partition PRUNING equivalent optimisation for queries on BUCKETED tables
Simplest example is for queries like:
"SELECT … FROM x WHERE colA=123123"
to read only the relevant bucket file rather than all file-buckets that belong
to a table.
> Sort Merge Partition Map Join
> -----------------------------
>
> Key: HIVE-12336
> URL: https://issues.apache.org/jira/browse/HIVE-12336
> Project: Hive
> Issue Type: Improvement
> Components: Logical Optimizer, Physical Optimizer, SQL
> Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0
> Reporter: Maciek Kocon
> Labels: gsoc2015
>
> Logically and functionally bucketing and partitioning are quite similar -
> both provide mechanism to segregate and separate the table's data based on
> its content. Thanks to that significant further optimisations like
> [partition] PRUNING or [bucket] MAP JOIN are possible.
> The difference seems to be imposed by design where the PARTITIONing is
> open/explicit while BUCKETing is discrete/implicit.
> Partitioning seems to be very common if not a standard feature in all current
> RDBMS while BUCKETING seems to be HIVE specific only.
> In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT
> PARTITIONING".
> Regardless of the fact that these two are recognised as two separate features
> available in Hive there should be nothing to prevent leveraging same existing
> query/join optimisations across the two.
> PARTITION MAPJOIN
> Use the same type of optimization as in SORT MERGE BUCKETED MAP JOIN for
> partitioned tables.
> The sort-merge join optimization could be performed when PARTITIONED tables
> being joined are sorted and partitioned on the join columns.
> The corresponding partitions are joined with each other at the mapper. If
> both A and B have partitions set on their columns KEY, the following join
> SELECT /*+ MAPJOIN(b) */ a.key, a.value
> FROM A a JOIN B b ON a.key = b.key
> can be done on the mapper only. The mapper for the partition key='201512' for
> A will traverse the corresponding partition for B. Traversing is possible if
> the corresponding partitions are sorted on the same columns.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)