Hi All, I have a spark job which runs over a huge bulk of data with Dynamic allocation enabled. The job takes some 15 minutes to start up and fails as soon as it starts*.
Is there anything I can check to debug this problem. There is not a lot of information in logs for the exact cause but here is some snapshot below. Thanks All. * - by starts I mean when it shows something on the spark web ui, before that its just blank page. Logs here - {code} 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number of 168510 executor(s). 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor containers, each with 2 cores and 6758 MB memory including 614 MB overhead 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 22 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 19 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 18 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 12 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 11 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 20 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 15 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 7 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 8 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 16 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 21 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 6 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 13 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 14 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 9 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 3 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 17 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 1 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 10 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 4 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 2 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for non-existent executor 5 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 time(s) in a row. java.lang.StackOverflowError at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) {code} ... <trimmed logs> {code} 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 7 at RPC address , but got no response. Marking as slave lost. org.apache.spark.SparkException: Fail to find loss reason for non-existent executor 7 at org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossReasonRequest(YarnAllocator.scala:554) at org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$anonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code}