[ 
https://issues.apache.org/jira/browse/IMPALA-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326846#comment-17326846
 ] 

Laszlo Gaal commented on IMPALA-10669:
--------------------------------------

Workaround idea: give less memory to the initial build container (performing 
Impala build and dataload) than the complete machine.
DAtaload runs with no problems on an r5.4xlarge with 128GB RAM, so let's try 
that. Use "-m 128g" to create the build container.

> Loading nested ORC data is flaky during Docker-based tests
> ----------------------------------------------------------
>
>                 Key: IMPALA-10669
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10669
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 4.0
>            Reporter: Laszlo Gaal
>            Assignee: Laszlo Gaal
>            Priority: Major
>
> Docker-based tests (using {{docker/test-wirh-docker.py}} often fail in the 
> dataload phase when trying to load ORC tables with complex types. The failure 
> happens quite often (at least in about 50% of the runs), and when it happens, 
> the failure pattern is quite consistent: it is always a Tez container 
> overrunning its allotted memory.
> The signature is:
> {code}
> 2021-04-18 13:32:19.551921 [2021-04-18 13:31:51.355]Container killed on 
> request. Exit code is 143
> 2021-04-18 13:32:19.551966 [2021-04-18 13:31:51.356]Container exited with a 
> non-zero exit code 143. 
> 2021-04-18 13:32:19.552181 ]], TaskAttempt 1 failed, info=[Container 
> container_1618776748992_0039_01_000003 finished with diagnostics set to 
> [Container failed, exitCode=-104. [2021-04-18 13:32:00.379]Container 
> [pid=11530,containerID=container_1618776748992_0039_01_000003] is running 
> 2785280B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB 
> physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.
> 2021-04-18 13:32:19.552224 Dump of the process-tree for 
> container_1618776748992_0039_01_000003 :
> 2021-04-18 13:32:19.552298    |- PID PPID PGRPID SESSID CMD_NAME 
> USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) 
> RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> 2021-04-18 13:32:19.552753    |- 11540 11530 11530 11530 (java) 2048 85 
> 2761297920 262152 
> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-2.el8_3.x86_64/bin/java 
> -Xmx819m -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN 
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/home/impdev/Impala/testdata/cluster/cdh7/node-1/var/log/hadoop-yarn/containers/application_1618776748992_0039/container_1618776748992_0039_01_000003
>  -Dtez.root.logger=INFO,CLA 
> -Djava.io.tmpdir=/home/impdev/Impala/testdata/cluster/cdh7/node-1/var/lib/hadoop-yarn/cache/impdev/nm-local-dir/usercache/impdev/appcache/application_1618776748992_0039/container_1618776748992_0039_01_000003/tmp
>  org.apache.tez.runtime.task.TezChild localhost 38999 
> container_1618776748992_0039_01_000003 application_1618776748992_0039 1 
> 2021-04-18 13:32:19.553375    |- 11530 11528 11530 11530 (bash) 0 0 10010624 
> 672 /bin/bash -c 
> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-2.el8_3.x86_64/bin/java  
> -Xmx819m -server -Djava.net.preferIPv4Stack=true 
> -Dhadoop.metrics.log.level=WARN  
> -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator 
> -Dlog4j.configuration=tez-container-log4j.properties 
> -Dyarn.app.container.log.dir=/home/impdev/Impala/testdata/cluster/cdh7/node-1/var/log/hadoop-yarn/containers/application_1618776748992_0039/container_1618776748992_0039_01_000003
>  -Dtez.root.logger=INFO,CLA  
> -Djava.io.tmpdir=/home/impdev/Impala/testdata/cluster/cdh7/node-1/var/lib/hadoop-yarn/cache/impdev/nm-local-dir/usercache/impdev/appcache/application_1618776748992_0039/container_1618776748992_0039_01_000003/tmp
>  org.apache.tez.runtime.task.TezChild localhost 38999 
> container_1618776748992_0039_01_000003 application_1618776748992_0039 1 
> 1>/home/impdev/Impala/testdata/cluster/cdh7/node-1/var/log/hadoop-yarn/containers/application_1618776748992_0039/container_1618776748992_0039_01_000003/stdout
>  
> 2>/home/impdev/Impala/testdata/cluster/cdh7/node-1/var/log/hadoop-yarn/containers/application_1618776748992_0039/container_1618776748992_0039_01_000003/stderr
>   
> {code}
> The failure has only been seen on AWS m5.12xl instances so far, which have 
> 192GB of RAM, all of which is available to the initial container doing the 
> compile/link and dataload phases of a test run.
> The same code runs with no problems on m5.4xl (64GB RAM) and r5.4xl (128GB 
> RAM) instances during other build jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to