Ankur Raj created TEZ-4507:
------------------------------

             Summary: Task failure due to memory issues in 0.10+
                 Key: TEZ-4507
                 URL: https://issues.apache.org/jira/browse/TEZ-4507
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.10.2, 0.10.1, 0.10.0
            Reporter: Ankur Raj


This issue is seen on AWS EMR-6.9.0 and above clusters which have tez-0.10.2 
branch. on running insert overwrite queries on huge amount of data; user in 
this case was using insert overwrite on 9 TB data. 

*Reduce tasks fail to read shuffle data and Hive query ultimately fails with 
{{corrupted double-linked list}} error.*

Logs:

```

{{}}
{code:java}

{code}
{{Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 
Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 
Status: Failed Vertex failed, vertexName=Reducer 2, 
vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, 
taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 failed, 
info=[Container container_1686949228586_0003_01_000050 finished with 
diagnostics set to [Container failed, exitCode=134. [2023-06-16 
21:54:04.463]Exception from container-launch. Container id: 
container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 
21:54:04.468]Container exited with a non-zero exit code 134. Error file: 
prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 
Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server 
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
-Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
-Dlog4j.configuration=tez-container-log4j.properties 
-Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
 -Dtez.root.logger=INFO,CLA 
-Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
-Dlog4j.configuration=tez-container-log4j.properties 
-Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
 
-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp
 org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 
container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > 
/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr
 Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task 
attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed 
running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 
21:29:22 Starting to run new task attempt: 
attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running 
task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:07 
Starting to run new task attempt: attempt_1686949228586_0003_1_00_001424_0 
2023-06-16 21:53:57 Completed running task attempt: 
attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:58 Starting to run 
new task attempt: attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 
21:54:04.469]Container exited with a non-zero exit code 134. Error file: 
prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 
Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server 
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
-Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
-Dlog4j.configuration=tez-container-log4j.properties 
-Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
 -Dtez.root.logger=INFO,CLA 
-Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
-Dlog4j.configuration=tez-container-log4j.properties 
-Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
 
-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp
 org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 
container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > 
/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout
 2> 
/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr
 Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task 
attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed 
running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 
21:29:22 Starting to run new task attempt: 
attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running 
task attempt: attempt_1686949228586_0003_1_00_001083_0}}

 

A private patch was created by reverting changes in TEZ-4135 and related 
commits, which fixed this issue. 

This is not reproduced on using around 100 GB of data. We are trying to 
reproduce this in our test environment, and have been unsuccesful so far. But 
issue always occurs in customer's env. User was unable to share more logs. I 
will add more logs once i am able to reproduce it in my environment. Opening 
this ticket now to get initial ideas from the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to