[ 
https://issues.apache.org/jira/browse/TEZ-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4506:
------------------------------
    Description: 
There was an OOM in a hive application, and in the console, I can see things 
like this:
{code}
Error: Error while compiling statement: FAILED: Execution Error, return code 2 
from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, 
diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, 
diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
failure ) : java.lang.RuntimeException: Reducer 3 operator initialization failed
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
Initialization failed. abortRequested=false
        at 
org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
        ... 17 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown
 Source)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
, errorMessage=Cannot recover from this error:java.lang.RuntimeException: 
Reducer 3 operator initialization failed
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
Initialization failed. abortRequested=false
        at 
org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
        ... 17 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown
 Source)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] 
killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, 
vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill 
while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer 4] 
killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 5, 
vertexId=vertex_1686687930454_0003_1_12, diagnostics=[Vertex received Kill 
while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
failedTasks:0 killedTasks:1, Vertex vertex_1686687930454_0003_1_12 [Reducer 5] 
killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to 
VERTEX_FAILURE. failedVertices:1 killedVertices:2 (state=08S01,code=2)
Closing: 0: 
jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera....@halxg.cloudera.com;ssl=true
{code}

the problem: even if there are thing like "diagnostics", I cannot see on which 
node the actual task attempt ran, so I need to dig into the logs, however I 
know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp configured 
for the tasks, so with the node info, I can easily go to the node directly and 
look at /tmp

  was:
There was an OOM in a hive application, and in the console, I can see things 
like this:
{code}
Error: Error while compiling statement: FAILED: Execution Error, return code 2 
from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, 
diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, 
diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
failure ) : java.lang.RuntimeException: Reducer 3 operator initialization failed
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
Initialization failed. abortRequested=false
        at 
org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
        ... 17 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown
 Source)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
, errorMessage=Cannot recover from this error:java.lang.RuntimeException: 
Reducer 3 operator initialization failed
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
        at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
        at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
Initialization failed. abortRequested=false
        at 
org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
        at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
        at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
        ... 17 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
        at 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
        at 
org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown
 Source)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
        at 
org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] 
killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, 
vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill 
while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer 4] 
killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 5, 
vertexId=vertex_1686687930454_0003_1_12, diagnostics=[Vertex received Kill 
while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
failedTasks:0 killedTasks:1, Vertex vertex_1686687930454_0003_1_12 [Reducer 5] 
killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to 
VERTEX_FAILURE. failedVertices:1 killedVertices:2 (state=08S01,code=2)
Closing: 0: 
jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera....@halxg.cloudera.com;ssl=true
{code}

the problem here is that even if there is thing like "diagnostics", I cannot 
see on which node the actual task attempt ran, so I need to dig into the logs, 
however I know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp 
configured for the tasks, so with the node info, I can easily go into the node 
directly and look at /tmp


> Report the node of a task attempt failure better
> ------------------------------------------------
>
>                 Key: TEZ-4506
>                 URL: https://issues.apache.org/jira/browse/TEZ-4506
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Priority: Major
>
> There was an OOM in a hive application, and in the console, I can see things 
> like this:
> {code}
> Error: Error while compiling statement: FAILED: Execution Error, return code 
> 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, 
> vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, 
> diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : java.lang.RuntimeException: Reducer 3 operator initialization 
> failed
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>       at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>       at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
> Initialization failed. abortRequested=false
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
>       ... 17 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown
>  Source)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       ... 3 more
> , errorMessage=Cannot recover from this error:java.lang.RuntimeException: 
> Reducer 3 operator initialization failed
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
>       at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
>       at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
>       at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>       at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async 
> Initialization failed. abortRequested=false
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
>       at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
>       at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
>       ... 17 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown
>  Source)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       ... 3 more
> ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 
> killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] 
> killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, 
> vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill 
> while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, 
> failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer 
> 4] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, 
> vertexName=Reducer 5, vertexId=vertex_1686687930454_0003_1_12, 
> diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not 
> succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex 
> vertex_1686687930454_0003_1_12 [Reducer 5] killed/failed due 
> to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. 
> failedVertices:1 killedVertices:2 (state=08S01,code=2)
> Closing: 0: 
> jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera....@halxg.cloudera.com;ssl=true
> {code}
> the problem: even if there are thing like "diagnostics", I cannot see on 
> which node the actual task attempt ran, so I need to dig into the logs, 
> however I know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp 
> configured for the tasks, so with the node info, I can easily go to the node 
> directly and look at /tmp



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to