Michael Ho created IMPALA-9083:
----------------------------------
Summary: Explore co-partitioning join strategies
Key: IMPALA-9083
URL: https://issues.apache.org/jira/browse/IMPALA-9083
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 3.4.0
Reporter: Michael Ho
The idea of co-partitioning join is not new and it has been implemented in
other systems (see
[link|https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/partition-wise-joins.html]
and
[link|http://amithora.com/understanding-co-partitions-and-co-grouping-in-spark/]).
With HDFS, we may not have as much control over the block locations so it's
not as easy to make assumption about the co-location of partitions of multiple
tables.
With remote reads (e.g. S3), we don't have this constraint anymore as all data
are remote. So, if we are joining on the partition key of two tables, we can
create a scan range schedule to co-locate partitions with the same partition
values on the same executor and do the join locally to avoid the subsequent
shuffling step.
The potential downside with this idea is that if there are skews in the data
distribution, we may overwhelm a few of the nodes in both the scans and join
operators. Previously, if we have skew in data distribution, we may still
overwhelm the join operator but the scans can be better parallelized. cc'ing
[~drorke]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]