[
https://issues.apache.org/jira/browse/HIVE-28548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shohei Okumiya updated HIVE-28548:
----------------------------------
Status: Patch Available (was: Open)
A patch is here
https://github.com/apache/hive/pull/5478
> Subdivide memory size allocated to parallel operators
> -----------------------------------------------------
>
> Key: HIVE-28548
> URL: https://issues.apache.org/jira/browse/HIVE-28548
> Project: Hive
> Issue Type: Sub-task
> Components: Physical Optimizer
> Affects Versions: 4.0.0
> Reporter: Shohei Okumiya
> Assignee: Shohei Okumiya
> Priority: Major
>
> SharedWorkOptimizer can merge operators which set up hash tables. Currently,
> each operator assumes it can use \{container memory} *
> \{hive.map.aggr.hash.percentmemory}. This assumption can cause OOM since
> merged operators run in parallel. It is unrealistic to avoid the issue by
> tuning `hive.map.aggr.hash.percentmemory` in this case because users can't
> know how many operators are likely merged ahead of time.
>
> Tez has some mechanism to allocate container's memory to multiple edges. Hive
> can also have something similar.
> [https://github.com/apache/tez/blob/rel/release-0.10.4/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/resources/WeightedScalingMemoryDistributor.java]
>
> *How to reproduce*
> Dummy data. The 10% of constants prevent optimization through
> `hive.map.aggr.hash.min.reduction`.
> {code:java}
> INSERT INTO test_data SELECT IF(RAND() < 0.1,
> '00000000-00000000-00000000-00000000', UUID()) AS key, UUID() AS value FROM
> {large table} {code}
> Query. You should add more joins if your container is large.
> {code:java}
> SELECT *
> FROM (SELECT key, MAX(value) FROM test_data WHERE key != '1' GROUP BY key) t1
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '2' GROUP
> BY key) t2 ON t1.key = t2.key
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '3' GROUP
> BY key) t3 ON t1.key = t3.key
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '4' GROUP
> BY key) t4 ON t1.key = t4.key
> LEFT OUTER JOIN (SELECT key, MAX(value) FROM test_data WHERE key != '5' GROUP
> BY key) t5 ON t1.key = t5.key {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)