Ankur Raj created TEZ-4507: ------------------------------ Summary: Task failure due to memory issues in 0.10+ Key: TEZ-4507 URL: https://issues.apache.org/jira/browse/TEZ-4507 Project: Apache Tez Issue Type: Bug Affects Versions: 0.10.2, 0.10.1, 0.10.0 Reporter: Ankur Raj
This issue is seen on AWS EMR-6.9.0 and above clusters which have tez-0.10.2 branch. on running insert overwrite queries on huge amount of data; user in this case was using insert overwrite on 9 TB data. *Reduce tasks fail to read shuffle data and Hive query ultimately fails with {{corrupted double-linked list}} error.* Logs: ``` {{}} {code:java} {code} {{Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 Status: Failed Vertex failed, vertexName=Reducer 2, vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 failed, info=[Container container_1686949228586_0003_01_000050 finished with diagnostics set to [Container failed, exitCode=134. [2023-06-16 21:54:04.463]Exception from container-launch. Container id: container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 21:54:04.468]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:07 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:58 Starting to run new task attempt: attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0}} A private patch was created by reverting changes in TEZ-4135 and related commits, which fixed this issue. This is not reproduced on using around 100 GB of data. We are trying to reproduce this in our test environment, and have been unsuccesful so far. But issue always occurs in customer's env. User was unable to share more logs. I will add more logs once i am able to reproduce it in my environment. Opening this ticket now to get initial ideas from the community. -- This message was sent by Atlassian Jira (v8.20.10#820010)