[
https://issues.apache.org/jira/browse/HIVE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko resolved HIVE-28428.
-----------------------------------
Fix Version/s: 4.1.0
Resolution: Fixed
> Map hash aggregation performance degradation
> ---------------------------------------------
>
> Key: HIVE-28428
> URL: https://issues.apache.org/jira/browse/HIVE-28428
> Project: Hive
> Issue Type: Improvement
> Reporter: Ryu Kobayashi
> Assignee: Ryu Kobayashi
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
> Attachments: 2024-08-02 14.35.46.png,
> image-2024-08-02-14-37-01-824.png, image-2024-08-02-14-38-45-459.png
>
>
> The following ticket has been fixed to enable map hash aggregation, but
> performance degradation than when it is disabled.
> https://issues.apache.org/jira/browse/HIVE-23356
> I found a few reasons for this. If there are a large number of keys, the
> following log will be output in large volume, affecting performance. And,
> this can also cause an OOM.
> {code:java}
> 2024-08-02 05:21:53,675 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl
> flush: #hash table = 171000
> 2024-08-02 05:21:53,713 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table
> flushed: new size = 153900
> {code}
> By fixing this, we can improve performance as follows.
> Before:
> !image-2024-08-02-14-37-01-824.png!
> After:
> !2024-08-02 14.35.46.png!
> And, currently the flush size is fixed, but performance can be improved by
> changing it depending on the data:
> !image-2024-08-02-14-38-45-459.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)