[ https://issues.apache.org/jira/browse/TEZ-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581807#comment-17581807 ]
László Bodor commented on TEZ-4442: ----------------------------------- hi [~AuthurWang], thanks for reporting this (CDP 7.1.7 has very recent tez patches backported, so it look more like a tez 0.10.x version, but this is not important at the moment) in application logs I can see lots of: {code} 2022-08-08 10:09:20,927 [ERROR] [main] |task.TezChild|: Error fetching new work for container container_e34_1659706606596_0047_01_000185 org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1562) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at com.sun.proxy.$Proxy8.getTask(Unknown Source) at org.apache.tez.runtime.task.ContainerReporter.callInternal(ContainerReporter.java:58) at org.apache.tez.runtime.task.ContainerReporter.callInternal(ContainerReporter.java:36) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} the exception is thrown while getting a new task, so it can be related to the payload itself this is not necessary a tez issue (can also be related to hive + tez integration), I recommend you to add heap dump creation parameters in case of OOM to task options: {code} tez.task.launch.cmd-opts {code} from a heapdump, it can be clearly seen what exactly occupies most of the memory > tez unable to control the memory size when UDF occupies 100MB memory > --------------------------------------------------------------------- > > Key: TEZ-4442 > URL: https://issues.apache.org/jira/browse/TEZ-4442 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.9.1 > Environment: CDP7.1.7SP1 > tez 0.9.1 > hive 3.1.3 > > Reporter: Authur Wang > Priority: Critical > Attachments: app.log, application_1659706606596_0047.log.gz, > hiveserver2.out, spark-udf-0.0.1-SNAPSHOT.jar, spark-udf-src.zip > > > We have a UDF which loads about 5 million records into memory, and > matchs the data in the memory according to the user's input, and finally > return the output. Each input record of the UDF will lead to one output. > Based on heapdump analysis, this udf occupies about 100MB of > memory. The UDF runs stably in hive on MR, hive on spark and native spark, > and only needs about 4GB of memory for that situation. However, if we use tez > engine, we adjust the memory from 4G to 8g, the task will fail. Even if we > adjust the memory to 12g, the task will fail with a high probability. Why > does tez engine need so much memory compared to Mr and spark? Is there a good > tuning method to control the amount of memory ? > > > command is as follows: > beeline -u > 'jdbc:hive2://bg21146.hadoop.com:10000/default;principal=hive/[bg21146.hadoop....@bg.com|mailto:bg21146.hadoop....@bg.com]' > --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e " > > create temporary function get_card_rank as > 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar > 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar'; > > set tez.am.log.level=debug; > set tez.am.resource.memory.mb=8192; > set hive.tez.container.size=8192; > set tez.task.resource.memory.mb=2048; > set tez.runtime.io.sort.mb=1200; > set hive.auto.convert.join.noconditionaltask.size=500000000; > set tez.runtime.unordered.output.buffer.size-mb=800; > set tez.grouping.min-size=33554432; > set tez.grouping.max-size=536870912; > set hive.tez.auto.reducer.parallelism=true; > set hive.tez.min.partition.factor=0.25; > set hive.tez.max.partition.factor=2.0; > set hive.exec.reducers.bytes.per.reducer=268435456; > set mapreduce.map.memory.mb=4096; > set ipc.maximum.response.length=1536000000; > > > select > get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md, > count(\*) > from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a > where a.hp_settle_dt = '20200910' > group by get_card_rank(ext_pri_acct_no) > ; > " > -- This message was sent by Atlassian Jira (v8.20.10#820010)