[
https://issues.apache.org/jira/browse/SPARK-35006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Huy updated SPARK-35006:
------------------------
Description:
*Issue Description*
I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on
K8s) (please note it's just for testing purpose). The corresponding
configuration can be found bellowing.
When I run task for loading & computing on a XML file; due to the size of XML
file is large (which I intended to) the executor got OOM error
#Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700
# fatal error: OutOfMemory encountered: Java heap space
However, the driver doesn't recognize this error as task failure scenario.
Instead, it consider this as a framework issue and continue retrying the task
INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver.
INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it
was being computed, its executor exited for a reason unrelated to the task.
*{color:#de350b}Not counting this failure towards the maximum number of
failures for the task{color}.*
INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from
BlockManagerMaster.
This results in the fact that, the Spark application keeps retrying the task
forever and locks other tasks from running
*Expectation*
Spark driver should classify OOM on executor pod as a failure due to task and
increase the count of max failure time
*Configuration*
spark.kubernetes.container.image: "spark_image_path"
spark.kubernetes.container.image.pullPolicy: "Always"
spark.kubernetes.namespace: "qa-namespace"
spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account"
spark.kubernetes.executor.request.cores: "2"
spark.kubernetes.executor.limit.cores: "2"
spark.executorEnv.SPARK_ENV: "dev"
spark.executor.memoryOverhead: "1G"
spark.executor.memory: "6g"
spark.executor.cores: "2"
spark.executor.instances: "3"
spark.driver.maxResultSize: "1g"
spark.driver.memory: "10g"
spark.driver.cores: "2"
spark.eventLog.enabled: 'true'
spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-XX:+UseG1GC \
-XX:+PrintFlagsFinal \
-XX:+PrintReferenceGC -verbose:gc \
-XX:+PrintGCDetails \
-XX:+PrintGCTimeStamps \
-XX:+PrintAdaptiveSizePolicy \
-XX:+UnlockDiagnosticVMOptions \
-XX:+G1SummarizeConcMark \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:ConcGCThreads=20 \
-XX:+PrintGCCause \
-XX:+AlwaysPreTouch \
-Dlog4j.debug=true -Dlog4j.configuration=[file:///].... "
spark.sql.session.timeZone: UTC
was:
*Issue Description*
I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on
K8s) (please note it's just for testing purpose). The corresponding
configuration can be found bellowing.
When I run task for loading & computing on a XML file; due to the size of XML
file is large (which I intended to) the executor got OOM error
#Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700
# fatal error: OutOfMemory encountered: Java heap space
However, the driver doesn't recognize this error as task failure scenario.
Instead, it consider this as a framework issue and continue retry the task
INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver.
INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while it
was being computed, its executor exited for a reason unrelated to the task.
*{color:#de350b}Not counting this failure towards the maximum number of
failures for the task{color}.*
INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from
BlockManagerMaster.
This results in the fact that, the Spark application keeps retrying the task
forever and locks following tasks from running
*Expectation*
Spark driver should classify OOM on executor pod as a failure due to task and
increase the count of max failure time
*Configuration*
spark.kubernetes.container.image: "spark_image_path"
spark.kubernetes.container.image.pullPolicy: "Always"
spark.kubernetes.namespace: "qa-namespace"
spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account"
spark.kubernetes.executor.request.cores: "2"
spark.kubernetes.executor.limit.cores: "2"
spark.executorEnv.SPARK_ENV: "dev"
spark.executor.memoryOverhead: "1G"
spark.executor.memory: "6g"
spark.executor.cores: "2"
spark.executor.instances: "3"
spark.driver.maxResultSize: "1g"
spark.driver.memory: "10g"
spark.driver.cores: "2"
spark.eventLog.enabled: 'true'
spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-XX:+UseG1GC \
-XX:+PrintFlagsFinal \
-XX:+PrintReferenceGC -verbose:gc \
-XX:+PrintGCDetails \
-XX:+PrintGCTimeStamps \
-XX:+PrintAdaptiveSizePolicy \
-XX:+UnlockDiagnosticVMOptions \
-XX:+G1SummarizeConcMark \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:ConcGCThreads=20 \
-XX:+PrintGCCause \
-XX:+AlwaysPreTouch \
-Dlog4j.debug=true -Dlog4j.configuration=[file:///].... "
spark.sql.session.timeZone: UTC
> Spark driver mistakenly classifies OOM error of executor (on K8s pod) as
> framework error
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-35006
> URL: https://issues.apache.org/jira/browse/SPARK-35006
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.1.1
> Reporter: Huy
> Priority: Major
>
> *Issue Description*
> I'm having a Spark application with 1 driver (on bare metal) + 1 executor (on
> K8s) (please note it's just for testing purpose). The corresponding
> configuration can be found bellowing.
> When I run task for loading & computing on a XML file; due to the size of XML
> file is large (which I intended to) the executor got OOM error
> #Aborting due to java.lang.OutOfMemoryError: Java heap space
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # Internal Error (debug.cpp:308), pid=19, tid=0x00007fff765ae700
> # fatal error: OutOfMemory encountered: Java heap space
>
> However, the driver doesn't recognize this error as task failure scenario.
> Instead, it consider this as a framework issue and continue retrying the task
> INFO TaskSchedulerImpl:57 - Executor 1 on 10.87.88.44 killed by driver.
> INFO TaskSetManager:57 - task 0.0 in stage 0.0 (TID 0) failed because while
> it was being computed, its executor exited for a reason unrelated to the
> task. *{color:#de350b}Not counting this failure towards the maximum number of
> failures for the task{color}.*
> INFO BlockManagerMasterEndpoint:57 - Trying to remove executor 1 from
> BlockManagerMaster.
>
> This results in the fact that, the Spark application keeps retrying the task
> forever and locks other tasks from running
>
> *Expectation*
> Spark driver should classify OOM on executor pod as a failure due to task and
> increase the count of max failure time
>
> *Configuration*
> spark.kubernetes.container.image: "spark_image_path"
> spark.kubernetes.container.image.pullPolicy: "Always"
> spark.kubernetes.namespace: "qa-namespace"
> spark.kubernetes.authenticate.driver.serviceAccountName: "svc-account"
> spark.kubernetes.executor.request.cores: "2"
> spark.kubernetes.executor.limit.cores: "2"
> spark.executorEnv.SPARK_ENV: "dev"
> spark.executor.memoryOverhead: "1G"
> spark.executor.memory: "6g"
> spark.executor.cores: "2"
> spark.executor.instances: "3"
> spark.driver.maxResultSize: "1g"
> spark.driver.memory: "10g"
> spark.driver.cores: "2"
> spark.eventLog.enabled: 'true'
> spark.driver.extraJavaOptions: "-Dcom.sun.management.jmxremote \
> -Dcom.sun.management.jmxremote.authenticate=false \
> -Dcom.sun.management.jmxremote.ssl=false \
> -XX:+UseG1GC \
> -XX:+PrintFlagsFinal \
> -XX:+PrintReferenceGC -verbose:gc \
> -XX:+PrintGCDetails \
> -XX:+PrintGCTimeStamps \
> -XX:+PrintAdaptiveSizePolicy \
> -XX:+UnlockDiagnosticVMOptions \
> -XX:+G1SummarizeConcMark \
> -XX:InitiatingHeapOccupancyPercent=35 \
> -XX:ConcGCThreads=20 \
> -XX:+PrintGCCause \
> -XX:+AlwaysPreTouch \
> -Dlog4j.debug=true -Dlog4j.configuration=[file:///].... "
> spark.sql.session.timeZone: UTC
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]