[
https://issues.apache.org/jira/browse/SPARK-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855598#comment-16855598
]
Stavros Kontopoulos commented on SPARK-27900:
---------------------------------------------
Setting Thread.setDefaultUncaughtExceptionHandler(new
SparkUncaughtExceptionHandler)
in SparkSubmit in client mode caused the handler to be invoked at the driver
side yet:
19/06/04 11:01:42 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
thread Thread[dag-scheduler-event-loop,5,main]
java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85)
at
org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:264)
at
org.apache.spark.scheduler.TaskSetManager.$anonfun$addPendingTasks$2(TaskSetManager.scala:194)
at
org.apache.spark.scheduler.TaskSetManager$$Lambda$1109/206130956.apply$mcVI$sp(Unknown
Source)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at
org.apache.spark.scheduler.TaskSetManager.$anonfun$addPendingTasks$1(TaskSetManager.scala:193)
at
org.apache.spark.scheduler.TaskSetManager$$Lambda$1108/329172165.apply$mcV$sp(Unknown
Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:534)
at
org.apache.spark.scheduler.TaskSetManager.addPendingTasks(TaskSetManager.scala:192)
at org.apache.spark.scheduler.TaskSetManager.<init>(TaskSetManager.scala:189)
at
org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:252)
at
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:210)
at
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1233)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1084)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2126)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2118)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2107)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
19/06/04 11:01:42 INFO SparkContext: Invoking stop() from shutdown hook
19/06/04 11:01:42 INFO SparkUI: Stopped Spark web UI at
http://spark-pi2-1559645994185-driver-svc.spark.svc:4040
19/06/04 11:01:42 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
spark-pi2-1559645994185-driver-svc.spark.svc:7079 in memory (size: 1765.0 B,
free: 110.0 MiB)
the main thread though is stuck at:
"main" #1 prio=5 os_prio=0 tid=0x00005653d3a5e800 nid=0x1d waiting on condition
[0x00007f31b7ca7000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000f0fd01a0> (a
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:242)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:736)
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L736]
|ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
|
This should be configurable imho.
> Spark on K8s will not report container failure due to an oom error
> ------------------------------------------------------------------
>
> Key: SPARK-27900
> URL: https://issues.apache.org/jira/browse/SPARK-27900
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.0.0, 2.4.3
> Reporter: Stavros Kontopoulos
> Priority: Major
>
> A spark pi job is running:
> spark-pi-driver 1/1 Running 0 1h
> spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
> spark-pi2-1559309337787-exec-2 1/1 Running 0 1h
> with the following setup:
> {quote}apiVersion: "sparkoperator.k8s.io/v1beta1"
> kind: SparkApplication
> metadata:
> name: spark-pi
> namespace: spark
> spec:
> type: Scala
> mode: cluster
> image: "skonto/spark:k8s-3.0.0-sa"
> imagePullPolicy: Always
> mainClass: org.apache.spark.examples.SparkPi
> mainApplicationFile:
> "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
> arguments:
> - "1000000"
> sparkVersion: "2.4.0"
> restartPolicy:
> type: Never
> nodeSelector:
> "spark": "autotune"
> driver:
> memory: "1g"
> labels:
> version: 2.4.0
> serviceAccount: spark-sa
> executor:
> instances: 2
> memory: "1g"
> labels:
> version: 2.4.0{quote}
> At some point the driver fails but it is still running and so the pods are
> still running:
> 19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
> 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 3.0 KiB, free 110.0 MiB)
> 19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes
> in memory (estimated size 1765.0 B, free 110.0 MiB)
> 19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
> on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free:
> 110.0 MiB)
> 19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at
> DAGScheduler.scala:1180
> 19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from
> ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15
> tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
> 14))
> 19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000
> tasks
> Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError:
> Java heap space
> at
> scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
> at
> scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
> at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
> Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached
> $ kubectl describe pod spark-pi2-driver -n spark
> Name: spark-pi2-driver
> Namespace: spark
> Priority: 0
> PriorityClassName: <none>
> Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
> Start Time: Fri, 31 May 2019 16:28:59 +0300
> Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
> spark-role=driver
> sparkoperator.k8s.io/app-name=spark-pi2
> sparkoperator.k8s.io/launched-by-spark-operator=true
> sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
> version=2.4.0
> Annotations: <none>
> Status: Running
> IP: 10.12.103.4
> Controlled By: SparkApplication/spark-pi2
> Containers:
> spark-kubernetes-driver:
> Container ID:
> docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
> Image: skonto/spark:k8s-3.0.0-sa
> Image ID:
> docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
> Ports: 7078/TCP, 7079/TCP, 4040/TCP
> Host Ports: 0/TCP, 0/TCP, 0/TCP
> Args:
> driver
> --properties-file
> /opt/spark/conf/spark.properties
> --class
> org.apache.spark.examples.SparkPi
> spark-internal
> 1000000
> State: Running
> In the container processes are in _interruptible sleep_:
> PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
> 15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp
> /opt/spark/conf/:/opt/spark/jars/* -Xmx500m
> org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
> 287 0 185 S 2344 0% 3 0% sh
> 294 287 185 R 1536 0% 3 0% top
> 1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf
> spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file
> /opt/spark/conf/spark.prope
> Liveness checks might be a workaround but rest apis may be still working if
> threads in jvm still are running as in this case (I did check the spark ui
> and it was there).
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]