[ 
https://issues.apache.org/jira/browse/IMPALA-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Maheshwari updated IMPALA-4857:
--------------------------------------
    Labels: 2023Q1 resource-management  (was: resource-management)

> Handle large # of duplicate keys on build side of a spilling hash join
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-4857
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4857
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Tim Armstrong
>            Priority: Minor
>              Labels: 2023Q1, resource-management
>
> Currently the hash join implementation relies on recursively repartitioning 
> the build side until a single partition can fit entirely in memory. This 
> works well in many cases, but can fail if there are a large number of rows 
> with duplicate keys that does not fit in the available memory.
> This results in an error like: "Cannot perform hash join at node with id 6. 
> Repartitioning did not reduce the size of a spilled partition. Repartitioning 
> level 6. Number of rows 275352"
> A special case of this is a Null-aware anti join with many NULLs on the build 
> side.
> This error often occurs because of a suboptimal query or plan that has a lot 
> of duplicate values on one side of the join. Changing the join operator to 
> spill in many of these cases would result in the query running to completion, 
> but very slowly (since it needs to do a quadratic pairwise comparison of both 
> sides of the join).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to