Michael Smith created HIVE-26346: ------------------------------------ Summary: Default Tez memory limits occasionally result in killing container Key: HIVE-26346 URL: https://issues.apache.org/jira/browse/HIVE-26346 Project: Hive Issue Type: Improvement Components: Tez Affects Versions: 3.1.3 Reporter: Michael Smith
When inserting data into Hive, the insert occasionally fails with messages like {quote} FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1605060173780_0039_2_00, diagnostics=[Task failed, taskId=task_1605060173780_0039_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1605060173780_0039_01_000002 finished with diagnostics set to [Container failed, exitCode=-104. [2020-11-11 02:35:11.768]Container [pid=16810,containerID=container_1605060173780_0039_01_000002] is running 7729152B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container. {quote} Specifically that the TezChild container is using some small amount of physical memory beyond its limit, so Tez kills the container. Identifying how to resolve this is somewhat fraught: - There's no clear troubleshooting advice around this error from our docs. Googling led to several forums that had some good and some awful advice. https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279 is probably the best one. - The issue itself comes down to Tez allocating 80% of the memory limit to Java heap (Xmx), which depending on other memory usage (stack memory, JIT, other JVM overhead) can be too little. By comparison: when running in a cgroup, Java defaults Xmx to 25% of the memory limit. - Identifying the right parameters to tune, and verifying they've been set correctly, was a bit challenging. We ended up playing with {{tez.container.max.java.heap.fraction}}, {{hive.tez.container.size}}, and {{yarn.scheduler.minimum-allocation-mb}}. I would then verify those took effect by monitoring process arguments (with {{htop}}) for any changes in Xmx. Definitely had some missteps figuring out when it's {{hive.tez.container}} vs {{tez.container}}. In the end, any of the following seems to have worked for us * {{SET yarn.scheduler.minimum-allocation-mb=2048}} * {{SET tez.container.max.java.heap.fraction=0.75}} * {{SET hive.tez.container.size=2048}} -- This message was sent by Atlassian Jira (v8.20.7#820007)