[jira] [Comment Edited] (TEZ-4507) Task failure due to memory issues in 0.10+

Jira Tue, 01 Aug 2023 23:57:06 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750146#comment-17750146
 ]


László Bodor edited comment on TEZ-4507 at 8/2/23 6:56 AM:
-----------------------------------------------------------

thanks for reporting this [~ankur_raj_134]
this malloc/free message is confusing in container logs, I thought in case of a 
low-level error it creates a dump, so I cannot really tell for the first sight 
what's happening there, some notes:
1. I guess it's hive that does IOW, isn't it?
2. which java version is the customer using? is different from yours?
3. there is a trace message 
[here|https://github.com/apache/tez/commit/28cf7dcd3b5cd422755e61b21848c31397c157d7#diff-85b3e25937d042db2d58b098d14a4d0db57e2bc4e44e707c68746da290994048R875]
 for the buffer size, could they enable trace logging in container to see if 
the problem is related to a particular sized compressed payload?
4. which compression are they using? can they try if changing the compression 
algorithm helps?


was (Author: abstractdog):
thanks for reporting this [~ankur_raj_134]
this malloc/free message is confusing in container logs, I thought in case of a 
low-level error it creates a dump, so I cannot really tell for the first sight 
what's happening there, 2 notes:
1. I guess it's hive that does IOW, isn't it?
2. which java version is the customer using? is different from yours?
3. there is a trace message 
[here|https://github.com/apache/tez/commit/28cf7dcd3b5cd422755e61b21848c31397c157d7#diff-85b3e25937d042db2d58b098d14a4d0db57e2bc4e44e707c68746da290994048R875]
 for the buffer size, could they enable trace logging in container to see if 
the problem is related to a particular sized compressed payload?
4. which compression are they using? can they try if changing the compression 
algorithm helps?

> Task failure due to memory issues in 0.10+
> ------------------------------------------
>
>                 Key: TEZ-4507
>                 URL: https://issues.apache.org/jira/browse/TEZ-4507
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.10.1, 0.10.2
>            Reporter: Ankur Raj
>            Priority: Major
>
> This issue is seen on clusters which have tez-0.10.2 branch. on running 
> insert overwrite queries on huge amount of data; user in this case was using 
> insert overwrite on 9 TB data. 
> *Reduce tasks fail to read shuffle data and Hive query ultimately fails with 
> corrupted memory errors.*
> Logs:
>  
> {code:java}
> Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 
> Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 
> Status: Failed Vertex failed, vertexName=Reducer 2, 
> vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, 
> taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 
> failed, info=[Container container_1686949228586_0003_01_000050 finished with 
> diagnostics set to [Container failed, exitCode=134. [2023-06-16 
> 21:54:04.463]Exception from container-launch. Container id: 
> container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 
> 21:54:04.468]Container exited with a non-zero exit code 134. Error file: 
> prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 
> Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server 
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
>  -Dtez.root.logger=INFO,CLA 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
>  
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp
>  org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 
> container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > 
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr
>  Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task 
> attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 
> Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 
> 2023-06-16 21:29:22 Starting to run new task attempt: 
> attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed 
> running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 
> 21:43:07 Starting to run new task attempt: 
> attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed 
> running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 
> 21:53:58 Starting to run new task attempt: 
> attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container 
> exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 
> bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted 
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server 
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
>  -Dtez.root.logger=INFO,CLA 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
>  
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp
>  org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 
> container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > 
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr
>  Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task 
> attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 
> Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 
> 2023-06-16 21:29:22 Starting to run new task attempt: 
> attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed 
> running task attempt: attempt_1686949228586_0003_1_00_001083_0
>  
> {code}
> Container logs :
>  
> {code:java}
> [2023-06-22 12:50:56.419]Container exited with a non-zero exit code 134. 
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 33001 Aborted                 
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server 
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099
>  -Dtez.root.logger=INFO,CLA 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099
>  
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000099/tmp
>  org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 
> container_1687437498460_0002_01_000099 application_1687437498460_0002 1 > 
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:50:55 Starting to run new task attempt: 
> attempt_1687437498460_0002_1_01_000732_0
> malloc(): memory corruption{code}
>  
>  
> {code:java}
> [2023-06-22 12:51:07.026]Container exited with a non-zero exit code 134. 
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 42026 Aborted                 
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server 
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265
>  -Dtez.root.logger=INFO,CLA 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265
>  
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000265/tmp
>  org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 
> container_1687437498460_0002_01_000265 application_1687437498460_0002 1 > 
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:51:06 Starting to run new task attempt: 
> attempt_1687437498460_0002_1_01_000732_2
> free(): invalid next size (normal){code}
>  
>  
>  
> {code:java}
> [2023-06-22 12:50:56.401]Container exited with a non-zero exit code 134. 
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 35840 Aborted                 
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server 
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105
>  -Dtez.root.logger=INFO,CLA 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105
>  
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000105/tmp
>  org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 
> container_1687437498460_0002_01_000105 application_1687437498460_0002 1 > 
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stdout
>  2> 
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:50:55 Starting to run new task attempt: 
> attempt_1687437498460_0002_1_01_000857_0
> java: malloc.c:4032: _int_malloc: Assertion `(unsigned long) (size) >= 
> (unsigned long) (nb)' failed.
>  
> {code}
> This seems to point to memory corruption.
> A private patch was created by reverting changes in TEZ-4135 and related 
> commits, which fixed this issue. 
> This is not reproduced on using around 100 GB of data. We are trying to 
> reproduce this in our test environment, and have been unsuccesful so far. But 
> issue always occurs in customer's env. User was unable to share more logs. I 
> will add more logs once i am able to reproduce it in my environment. Opening 
> this ticket now to get initial ideas from the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TEZ-4507) Task failure due to memory issues in 0.10+

Reply via email to