Michael Smith created HIVE-26346:
------------------------------------

             Summary: Default Tez memory limits occasionally result in killing 
container
                 Key: HIVE-26346
                 URL: https://issues.apache.org/jira/browse/HIVE-26346
             Project: Hive
          Issue Type: Improvement
          Components: Tez
    Affects Versions: 3.1.3
            Reporter: Michael Smith


When inserting data into Hive, the insert occasionally fails with messages like
{quote}
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, 
vertexId=vertex_1605060173780_0039_2_00, diagnostics=[Task failed, 
taskId=task_1605060173780_0039_2_00_000000, diagnostics=[TaskAttempt 0 failed, 
info=[Container container_1605060173780_0039_01_000002 finished with 
diagnostics set to [Container failed, exitCode=-104. [2020-11-11 
02:35:11.768]Container 
[pid=16810,containerID=container_1605060173780_0039_01_000002] is running 
7729152B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB 
physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container.
{quote}

Specifically that the TezChild container is using some small amount of physical 
memory beyond its limit, so Tez kills the container.

Identifying how to resolve this is somewhat fraught:
- There's no clear troubleshooting advice around this error from our docs. 
Googling led to several forums that had some good and some awful advice. 
https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279
 is probably the best one.
- The issue itself comes down to Tez allocating 80% of the memory limit to Java 
heap (Xmx), which depending on other memory usage (stack memory, JIT, other JVM 
overhead) can be too little. By comparison: when running in a cgroup, Java 
defaults Xmx to 25% of the memory limit.
- Identifying the right parameters to tune, and verifying they've been set 
correctly, was a bit challenging. We ended up playing with 
{{tez.container.max.java.heap.fraction}}, {{hive.tez.container.size}}, and 
{{yarn.scheduler.minimum-allocation-mb}}. I would then verify those took effect 
by monitoring process arguments (with {{htop}}) for any changes in Xmx. 
Definitely had some missteps figuring out when it's {{hive.tez.container}} vs 
{{tez.container}}.

In the end, any of the following seems to have worked for us
* {{SET yarn.scheduler.minimum-allocation-mb=2048}}
* {{SET tez.container.max.java.heap.fraction=0.75}}
* {{SET hive.tez.container.size=2048}}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to