[
https://issues.apache.org/jira/browse/IMPALA-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manish Maheshwari updated IMPALA-4857:
--------------------------------------
Labels: 2023Q1 resource-management (was: resource-management)
> Handle large # of duplicate keys on build side of a spilling hash join
> ----------------------------------------------------------------------
>
> Key: IMPALA-4857
> URL: https://issues.apache.org/jira/browse/IMPALA-4857
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 2.9.0
> Reporter: Tim Armstrong
> Priority: Minor
> Labels: 2023Q1, resource-management
>
> Currently the hash join implementation relies on recursively repartitioning
> the build side until a single partition can fit entirely in memory. This
> works well in many cases, but can fail if there are a large number of rows
> with duplicate keys that does not fit in the available memory.
> This results in an error like: "Cannot perform hash join at node with id 6.
> Repartitioning did not reduce the size of a spilled partition. Repartitioning
> level 6. Number of rows 275352"
> A special case of this is a Null-aware anti join with many NULLs on the build
> side.
> This error often occurs because of a suboptimal query or plan that has a lot
> of duplicate values on one side of the join. Changing the join operator to
> spill in many of these cases would result in the query running to completion,
> but very slowly (since it needs to do a quadratic pairwise comparison of both
> sides of the join).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]