[ https://issues.apache.org/jira/browse/HIVE-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390546#comment-16390546 ]
Sahil Takiar commented on HIVE-18869: ------------------------------------- When testing this, this is the error the Spark throws when an executor dies due to OOM issues: {code} ERROR : Job failed with org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): ExecutorLostFailure (executor 65 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520386678430_0109_01_000068 on host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 14:36:02.029]Container killed on request. Exit code is 143 java.util.concurrent.ExecutionException: Exception thrown by job at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272) at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:364) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:325) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): ExecutorLostFailure (executor 65 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520386678430_0109_01_000068 on host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 14:36:02.029]Container killed on request. Exit code is 143 [2018-03-07 14:36:02.029]Container exited with a non-zero exit code 143. [2018-03-07 14:36:02.029]Killed by external signal {code} Might be good to differentiate {{ExecutorLost}} failures too. > Improve SparkTask OOM Error Parsing Logic > ----------------------------------------- > > Key: HIVE-18869 > URL: https://issues.apache.org/jira/browse/HIVE-18869 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Sahil Takiar > Priority: Major > > The method {{SparkTask#isOOMError}} parses a stack-trace to check if it is > due to an OOM error. A few improvements could be made: > * Differentiate between driver OOM and task OOM > * The string {{Container killed by YARN for exceeding memory limits}} is > printed if a container exceeds its memory limits, but Spark tasks can OOM for > other reasons, such as {{GC overhead limit exceeded}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)