[ 
https://issues.apache.org/jira/browse/HIVE-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390546#comment-16390546
 ] 

Sahil Takiar commented on HIVE-18869:
-------------------------------------

When testing this, this is the error the Spark throws when an executor dies due 
to OOM issues:

{code}
ERROR : Job failed with org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost 
task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): 
ExecutorLostFailure (executor 65 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_1520386678430_0109_01_000068 on 
host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 
14:36:02.029]Container killed on request. Exit code is 143
java.util.concurrent.ExecutionException: Exception thrown by job
        at 
org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
        at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
        at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:364)
        at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:325)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in 
stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): 
ExecutorLostFailure (executor 65 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_1520386678430_0109_01_000068 on 
host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 
14:36:02.029]Container killed on request. Exit code is 143
[2018-03-07 14:36:02.029]Container exited with a non-zero exit code 143.
[2018-03-07 14:36:02.029]Killed by external signal
{code}

Might be good to differentiate {{ExecutorLost}} failures too.

> Improve SparkTask OOM Error Parsing Logic
> -----------------------------------------
>
>                 Key: HIVE-18869
>                 URL: https://issues.apache.org/jira/browse/HIVE-18869
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Priority: Major
>
> The method {{SparkTask#isOOMError}} parses a stack-trace to check if it is 
> due to an OOM error. A few improvements could be made:
> * Differentiate between driver OOM and task OOM
> * The string {{Container killed by YARN for exceeding memory limits}} is 
> printed if a container exceeds its memory limits, but Spark tasks can OOM for 
> other reasons, such as {{GC overhead limit exceeded}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to