[ https://issues.apache.org/jira/browse/TEZ-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated TEZ-4506: ------------------------------ Description: There was an OOM in a hive application, and in the console, I can see things like this: {code} Error: Error while compiling statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.RuntimeException: Reducer 3 operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async Initialization failed. abortRequested=false at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237) ... 17 more Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238) at org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown Source) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more , errorMessage=Cannot recover from this error:java.lang.RuntimeException: Reducer 3 operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async Initialization failed. abortRequested=false at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237) ... 17 more Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238) at org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown Source) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer 4] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 5, vertexId=vertex_1686687930454_0003_1_12, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1686687930454_0003_1_12 [Reducer 5] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2 (state=08S01,code=2) Closing: 0: jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera....@halxg.cloudera.com;ssl=true {code} the problem: even if there are thing like "diagnostics", I cannot see on which node the actual task attempt ran, so I need to dig into the logs, however I know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp configured for the tasks, so with the node info, I can easily go to the node directly and look at /tmp was: There was an OOM in a hive application, and in the console, I can see things like this: {code} Error: Error while compiling statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.RuntimeException: Reducer 3 operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async Initialization failed. abortRequested=false at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237) ... 17 more Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238) at org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown Source) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more , errorMessage=Cannot recover from this error:java.lang.RuntimeException: Reducer 3 operator initialization failed at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async Initialization failed. abortRequested=false at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237) ... 17 more Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184) at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454) at org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238) at org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown Source) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96) at org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer 4] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 5, vertexId=vertex_1686687930454_0003_1_12, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1686687930454_0003_1_12 [Reducer 5] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2 (state=08S01,code=2) Closing: 0: jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera....@halxg.cloudera.com;ssl=true {code} the problem here is that even if there is thing like "diagnostics", I cannot see on which node the actual task attempt ran, so I need to dig into the logs, however I know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp configured for the tasks, so with the node info, I can easily go into the node directly and look at /tmp > Report the node of a task attempt failure better > ------------------------------------------------ > > Key: TEZ-4506 > URL: https://issues.apache.org/jira/browse/TEZ-4506 > Project: Apache Tez > Issue Type: Improvement > Reporter: László Bodor > Priority: Major > > There was an OOM in a hive application, and in the console, I can see things > like this: > {code} > Error: Error while compiling statement: FAILED: Execution Error, return code > 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, > vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, > diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : java.lang.RuntimeException: Reducer 3 operator initialization > failed > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async > Initialization failed. abortRequested=false > at > org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) > at > org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) > at > org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237) > ... 17 more > Caused by: java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown > Source) > at > org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96) > at > org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > , errorMessage=Cannot recover from this error:java.lang.RuntimeException: > Reducer 3 operator initialization failed > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async > Initialization failed. abortRequested=false > at > org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) > at > org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571) > at > org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237) > ... 17 more > Caused by: java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184) > at > org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown > Source) > at > org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96) > at > org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 > killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] > killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, > vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill > while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, > failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer > 4] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, > vertexName=Reducer 5, vertexId=vertex_1686687930454_0003_1_12, > diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not > succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex > vertex_1686687930454_0003_1_12 [Reducer 5] killed/failed due > to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. > failedVertices:1 killedVertices:2 (state=08S01,code=2) > Closing: 0: > jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera....@halxg.cloudera.com;ssl=true > {code} > the problem: even if there are thing like "diagnostics", I cannot see on > which node the actual task attempt ran, so I need to dig into the logs, > however I know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp > configured for the tasks, so with the node info, I can easily go to the node > directly and look at /tmp -- This message was sent by Atlassian Jira (v8.20.10#820010)