[ https://issues.apache.org/jira/browse/HIVE-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciek Kocon updated HIVE-12334: -------------------------------- Description: Logically and functionally bucketing and partitioning are quite similar - both provide mechanism to segregate and separate the table's data based on its content. Thanks to that significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible. The difference seems to be imposed by design where the PARTITIONing is open/explicit while BUCKETing is discrete/implicit. Partitioning seems to be very common if not a standard feature in all current RDBMS while BUCKETING seems to be HIVE specific only. In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING". Regardless of the fact that these two are recognised as two separate features available in Hive there should be nothing to prevent leveraging same existing query/join optimisations across the two. PARTITION MAPJOIN Use the same type of optimization as in BUCKETED MAP JOIN for PARTITIONED tables. The partition map join could be performed if the tables being joined are partitioned on the join columns. If table A has set partitioning on KEY column and table B is partitioned on KEY column, the following join SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a JOIN b ON a.key = b.key can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required partitions are fetched. For the query above, the mapper processing partition key='20151208' for A will only fetch partition for key='20151208' of B. was: Logically and functionally bucketing and partitioning are quite similar - both provide mechanism to segregate and separate the table's data based on its content. Thanks to that significant further optimisations like [partition] PRUNING or [bucket] MAP JOIN are possible. The difference seems to be imposed by design where the PARTITIONing is open/explicit while BUCKETing is discrete/implicit. Partitioning seems to be very common if not a standard feature in all current RDBMS while BUCKETING seems to be HIVE specific only. In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT PARTITIONING". Regardless of the fact that these two are recognised as two separate features available in Hive there should be nothing to prevent leveraging same existing query/join optimisations across the two. PARTITION MAPJOIN Use the same type of optimization as in BUCKETED MAP JOIN when PARTITIONED tables being joined are partitioned on the join columns: If table A has set partitioning on KEY column and table B is partitioned on KEY column, the following join SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a JOIN b ON a.key = b.key can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required partitions are fetched. For the query above, the mapper processing partition key='20151208' for A will only fetch partition for key='20151208' of B. > Partition Map Join > ------------------ > > Key: HIVE-12334 > URL: https://issues.apache.org/jira/browse/HIVE-12334 > Project: Hive > Issue Type: Improvement > Components: Logical Optimizer, Physical Optimizer, SQL > Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.0.0, 1.1.0 > Reporter: Maciek Kocon > Labels: gsoc2015 > > Logically and functionally bucketing and partitioning are quite similar - > both provide mechanism to segregate and separate the table's data based on > its content. Thanks to that significant further optimisations like > [partition] PRUNING or [bucket] MAP JOIN are possible. > The difference seems to be imposed by design where the PARTITIONing is > open/explicit while BUCKETing is discrete/implicit. > Partitioning seems to be very common if not a standard feature in all current > RDBMS while BUCKETING seems to be HIVE specific only. > In a way BUCKETING could be also called by "hashing" or simply "IMPLICIT > PARTITIONING". > Regardless of the fact that these two are recognised as two separate features > available in Hive there should be nothing to prevent leveraging same existing > query/join optimisations across the two. > PARTITION MAPJOIN > Use the same type of optimization as in BUCKETED MAP JOIN for PARTITIONED > tables. > The partition map join could be performed if the tables being joined are > partitioned on the join columns. > If table A has set partitioning on KEY column and table B is partitioned on > KEY column, the following join > SELECT /*+ MAPJOIN(b) */ a.key, a.value > FROM a JOIN b ON a.key = b.key > can be done on the mapper only. Instead of fetching B completely for each > mapper of A, only the required partitions are fetched. For the query above, > the mapper processing partition key='20151208' for A will only fetch > partition for key='20151208' of B. -- This message was sent by Atlassian JIRA (v6.3.4#6332)