[
https://issues.apache.org/jira/browse/TEZ-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750146#comment-17750146
]
László Bodor commented on TEZ-4507:
-----------------------------------
thanks for reporting this [~ankur_raj_134]
this malloc/free message is confusing in container logs, I thought in case of a
low-level error it creates a dump, so I cannot really tell for the first sight
what's happening there, 2 notes:
1. I guess it's hive that does IOW, isn't it?
2. which java version is the customer using? is different from yours?
3. there is a trace message
[here|https://github.com/apache/tez/commit/28cf7dcd3b5cd422755e61b21848c31397c157d7#diff-85b3e25937d042db2d58b098d14a4d0db57e2bc4e44e707c68746da290994048R875]
for the buffer size, could they enable trace logging in container to see if
the problem is related to a particular sized compressed payload?
4. which compression are they using? can they try if changing the compression
algorithm helps?
> Task failure due to memory issues in 0.10+
> ------------------------------------------
>
> Key: TEZ-4507
> URL: https://issues.apache.org/jira/browse/TEZ-4507
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.10.0, 0.10.1, 0.10.2
> Reporter: Ankur Raj
> Priority: Major
>
> This issue is seen on clusters which have tez-0.10.2 branch. on running
> insert overwrite queries on huge amount of data; user in this case was using
> insert overwrite on 9 TB data.
> *Reduce tasks fail to read shuffle data and Hive query ultimately fails with
> corrupted memory errors.*
> Logs:
>
> {code:java}
> Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922
> Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414
> Status: Failed Vertex failed, vertexName=Reducer 2,
> vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed,
> taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0
> failed, info=[Container container_1686949228586_0003_01_000050 finished with
> diagnostics set to [Container failed, exitCode=134. [2023-06-16
> 21:54:04.463]Exception from container-launch. Container id:
> container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16
> 21:54:04.468]Container exited with a non-zero exit code 134. Error file:
> prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148
> Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
> -Dtez.root.logger=INFO,CLA
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
>
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp
> org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871
> container_1686949228586_0003_01_000050 application_1686949228586_0003 1 >
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout
> 2>
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr
> Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task
> attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21
> Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0
> 2023-06-16 21:29:22 Starting to run new task attempt:
> attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed
> running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16
> 21:43:07 Starting to run new task attempt:
> attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed
> running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16
> 21:53:58 Starting to run new task attempt:
> attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container
> exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096
> bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
> -Dtez.root.logger=INFO,CLA
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050
>
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp
> org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871
> container_1686949228586_0003_01_000050 application_1686949228586_0003 1 >
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout
> 2>
> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr
> Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task
> attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21
> Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0
> 2023-06-16 21:29:22 Starting to run new task attempt:
> attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed
> running task attempt: attempt_1686949228586_0003_1_00_001083_0
>
> {code}
> Container logs :
>
> {code:java}
> [2023-06-22 12:50:56.419]Container exited with a non-zero exit code 134.
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 33001 Aborted
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099
> -Dtez.root.logger=INFO,CLA
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099
>
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000099/tmp
> org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193
> container_1687437498460_0002_01_000099 application_1687437498460_0002 1 >
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stdout
> 2>
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:50:55 Starting to run new task attempt:
> attempt_1687437498460_0002_1_01_000732_0
> malloc(): memory corruption{code}
>
>
> {code:java}
> [2023-06-22 12:51:07.026]Container exited with a non-zero exit code 134.
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 42026 Aborted
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265
> -Dtez.root.logger=INFO,CLA
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265
>
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000265/tmp
> org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193
> container_1687437498460_0002_01_000265 application_1687437498460_0002 1 >
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stdout
> 2>
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:51:06 Starting to run new task attempt:
> attempt_1687437498460_0002_1_01_000732_2
> free(): invalid next size (normal){code}
>
>
>
> {code:java}
> [2023-06-22 12:50:56.401]Container exited with a non-zero exit code 134.
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 35840 Aborted
> /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105
> -Dtez.root.logger=INFO,CLA
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator
> -Dlog4j.configuration=tez-container-log4j.properties
> -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105
>
> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000105/tmp
> org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193
> container_1687437498460_0002_01_000105 application_1687437498460_0002 1 >
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stdout
> 2>
> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:50:55 Starting to run new task attempt:
> attempt_1687437498460_0002_1_01_000857_0
> java: malloc.c:4032: _int_malloc: Assertion `(unsigned long) (size) >=
> (unsigned long) (nb)' failed.
>
> {code}
> This seems to point to memory corruption.
> A private patch was created by reverting changes in TEZ-4135 and related
> commits, which fixed this issue.
> This is not reproduced on using around 100 GB of data. We are trying to
> reproduce this in our test environment, and have been unsuccesful so far. But
> issue always occurs in customer's env. User was unable to share more logs. I
> will add more logs once i am able to reproduce it in my environment. Opening
> this ticket now to get initial ideas from the community.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)