[ https://issues.apache.org/jira/browse/TEZ-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750146#comment-17750146 ]
László Bodor edited comment on TEZ-4507 at 8/2/23 6:56 AM: ----------------------------------------------------------- thanks for reporting this [~ankur_raj_134] this malloc/free message is confusing in container logs, I thought in case of a low-level error it creates a dump, so I cannot really tell for the first sight what's happening there, some notes: 1. I guess it's hive that does IOW, isn't it? 2. which java version is the customer using? is different from yours? 3. there is a trace message [here|https://github.com/apache/tez/commit/28cf7dcd3b5cd422755e61b21848c31397c157d7#diff-85b3e25937d042db2d58b098d14a4d0db57e2bc4e44e707c68746da290994048R875] for the buffer size, could they enable trace logging in container to see if the problem is related to a particular sized compressed payload? 4. which compression are they using? can they try if changing the compression algorithm helps? was (Author: abstractdog): thanks for reporting this [~ankur_raj_134] this malloc/free message is confusing in container logs, I thought in case of a low-level error it creates a dump, so I cannot really tell for the first sight what's happening there, 2 notes: 1. I guess it's hive that does IOW, isn't it? 2. which java version is the customer using? is different from yours? 3. there is a trace message [here|https://github.com/apache/tez/commit/28cf7dcd3b5cd422755e61b21848c31397c157d7#diff-85b3e25937d042db2d58b098d14a4d0db57e2bc4e44e707c68746da290994048R875] for the buffer size, could they enable trace logging in container to see if the problem is related to a particular sized compressed payload? 4. which compression are they using? can they try if changing the compression algorithm helps? > Task failure due to memory issues in 0.10+ > ------------------------------------------ > > Key: TEZ-4507 > URL: https://issues.apache.org/jira/browse/TEZ-4507 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.10.0, 0.10.1, 0.10.2 > Reporter: Ankur Raj > Priority: Major > > This issue is seen on clusters which have tez-0.10.2 branch. on running > insert overwrite queries on huge amount of data; user in this case was using > insert overwrite on 9 TB data. > *Reduce tasks fail to read shuffle data and Hive query ultimately fails with > corrupted memory errors.* > Logs: > > {code:java} > Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 > Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 > Status: Failed Vertex failed, vertexName=Reducer 2, > vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, > taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 > failed, info=[Container container_1686949228586_0003_01_000050 finished with > diagnostics set to [Container failed, exitCode=134. [2023-06-16 > 21:54:04.463]Exception from container-launch. Container id: > container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 > 21:54:04.468]Container exited with a non-zero exit code 134. Error file: > prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 > Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server > -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 > -Dtez.root.logger=INFO,CLA > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 > > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp > org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 > container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout > 2> > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr > Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task > attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 > Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 > 2023-06-16 21:29:22 Starting to run new task attempt: > attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed > running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 > 21:43:07 Starting to run new task attempt: > attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed > running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 > 21:53:58 Starting to run new task attempt: > attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container > exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 > bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted > /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server > -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 > -Dtez.root.logger=INFO,CLA > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 > > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp > org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 > container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout > 2> > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr > Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task > attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 > Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 > 2023-06-16 21:29:22 Starting to run new task attempt: > attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed > running task attempt: attempt_1686949228586_0003_1_00_001083_0 > > {code} > Container logs : > > {code:java} > [2023-06-22 12:50:56.419]Container exited with a non-zero exit code 134. > Error file: prelaunch.err. > Last 4096 bytes of prelaunch.err : > /bin/bash: line 1: 33001 Aborted > /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server > -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 > -Dtez.root.logger=INFO,CLA > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 > > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000099/tmp > org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 > container_1687437498460_0002_01_000099 application_1687437498460_0002 1 > > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stdout > 2> > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stderr > Last 4096 bytes of stderr : > 2023-06-22 12:50:55 Starting to run new task attempt: > attempt_1687437498460_0002_1_01_000732_0 > malloc(): memory corruption{code} > > > {code:java} > [2023-06-22 12:51:07.026]Container exited with a non-zero exit code 134. > Error file: prelaunch.err. > Last 4096 bytes of prelaunch.err : > /bin/bash: line 1: 42026 Aborted > /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server > -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 > -Dtez.root.logger=INFO,CLA > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 > > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000265/tmp > org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 > container_1687437498460_0002_01_000265 application_1687437498460_0002 1 > > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stdout > 2> > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stderr > Last 4096 bytes of stderr : > 2023-06-22 12:51:06 Starting to run new task attempt: > attempt_1687437498460_0002_1_01_000732_2 > free(): invalid next size (normal){code} > > > > {code:java} > [2023-06-22 12:50:56.401]Container exited with a non-zero exit code 134. > Error file: prelaunch.err. > Last 4096 bytes of prelaunch.err : > /bin/bash: line 1: 35840 Aborted > /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server > -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 > -Dtez.root.logger=INFO,CLA > -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator > -Dlog4j.configuration=tez-container-log4j.properties > -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 > > -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000105/tmp > org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 > container_1687437498460_0002_01_000105 application_1687437498460_0002 1 > > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stdout > 2> > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stderr > Last 4096 bytes of stderr : > 2023-06-22 12:50:55 Starting to run new task attempt: > attempt_1687437498460_0002_1_01_000857_0 > java: malloc.c:4032: _int_malloc: Assertion `(unsigned long) (size) >= > (unsigned long) (nb)' failed. > > {code} > This seems to point to memory corruption. > A private patch was created by reverting changes in TEZ-4135 and related > commits, which fixed this issue. > This is not reproduced on using around 100 GB of data. We are trying to > reproduce this in our test environment, and have been unsuccesful so far. But > issue always occurs in customer's env. User was unable to share more logs. I > will add more logs once i am able to reproduce it in my environment. Opening > this ticket now to get initial ideas from the community. -- This message was sent by Atlassian Jira (v8.20.10#820010)