Have been playing around with configs to crack this. Adding them here where it would be helpful to others :) Number of executors and timeout seemed like the core issue.
{code} --driver-memory 4G \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.maxExecutors=500 \ --conf spark.core.connection.ack.wait.timeout=6000 \ --conf spark.akka.heartbeat.interval=6000 \ --conf spark.akka.frameSize=100 \ --conf spark.akka.timeout=6000 \ {code} Cheers ! On Fri, Sep 23, 2016 at 7:50 PM, <aditya.calangut...@augmentiq.co.in> wrote: > For testing purpose can you run with fix number of executors and try. May > be 12 executors for testing and let know the status. > > Get Outlook for Android <https://aka.ms/ghei36> > > > > On Fri, Sep 23, 2016 at 3:13 PM +0530, "Yash Sharma" <yash...@gmail.com> > wrote: > > Thanks Aditya, appreciate the help. >> >> I had the exact thought about the huge number of executors requested. >> I am going with the dynamic executors and not specifying the number of >> executors. Are you suggesting that I should limit the number of executors >> when the dynamic allocator requests for more number of executors. >> >> Its a 12 node EMR cluster and has more than a Tb of memory. >> >> >> >> On Fri, Sep 23, 2016 at 5:12 PM, Aditya <aditya.calangutkar@augmentiq. >> co.in> wrote: >> >>> Hi Yash, >>> >>> What is your total cluster memory and number of cores? >>> Problem might be with the number of executors you are allocating. The >>> logs shows it as 168510 which is on very high side. Try reducing your >>> executors. >>> >>> >>> On Friday 23 September 2016 12:34 PM, Yash Sharma wrote: >>> >>>> Hi All, >>>> I have a spark job which runs over a huge bulk of data with Dynamic >>>> allocation enabled. >>>> The job takes some 15 minutes to start up and fails as soon as it >>>> starts*. >>>> >>>> Is there anything I can check to debug this problem. There is not a lot >>>> of information in logs for the exact cause but here is some snapshot below. >>>> >>>> Thanks All. >>>> >>>> * - by starts I mean when it shows something on the spark web ui, >>>> before that its just blank page. >>>> >>>> Logs here - >>>> >>>> {code} >>>> 16/09/23 06:33:19 INFO ApplicationMaster: Started progress reporter >>>> thread with (heartbeat : 3000, initial allocation : 200) intervals >>>> 16/09/23 06:33:27 INFO YarnAllocator: Driver requested a total number >>>> of 168510 executor(s). >>>> 16/09/23 06:33:27 INFO YarnAllocator: Will request 168510 executor >>>> containers, each with 2 cores and 6758 MB memory including 614 MB overhead >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 22 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 19 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 18 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 12 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 11 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 20 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 15 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 7 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 8 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 16 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 21 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 6 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 13 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 14 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 9 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 3 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 17 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 1 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 10 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 4 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 2 >>>> 16/09/23 06:33:36 WARN YarnAllocator: Tried to get the loss reason for >>>> non-existent executor 5 >>>> 16/09/23 06:33:36 WARN ApplicationMaster: Reporter thread fails 1 >>>> time(s) in a row. >>>> java.lang.StackOverflowError >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.TraversableLike$WithFilter$$anonfun$foreach >>>> $1.apply(TraversableLike.scala:772) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.TraversableLike$WithFilter$$anonfun$foreach >>>> $1.apply(TraversableLike.scala:772) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.TraversableLike$WithFilter$$anonfun$foreach >>>> $1.apply(TraversableLike.scala:772) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.app >>>> ly(MapLike.scala:245) >>>> {code} >>>> >>>> ... <trimmed logs> >>>> >>>> {code} >>>> 16/09/23 06:33:36 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: >>>> Attempted to get executor loss reason for executor id 7 at RPC address , >>>> but got no response. Marking as slave lost. >>>> org.apache.spark.SparkException: Fail to find loss reason for >>>> non-existent executor 7 >>>> at org.apache.spark.deploy.yarn.YarnAllocator.enqueueGetLossRea >>>> sonRequest(YarnAllocator.scala:554) >>>> at org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint$$a >>>> nonfun$receiveAndReply$1.applyOrElse(ApplicationMaster.scala:632) >>>> at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mc >>>> V$sp(Inbox.scala:104) >>>> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) >>>> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) >>>> at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispat >>>> cher.scala:215) >>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>> Executor.java:1145) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>> lExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:745) >>>> {code} >>>> >>> >>> >>> >>> >>> >> >