Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic.
-Sandy On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <[email protected]> wrote: > Hi Sandy, > > Any resolution for YARN failures ? It's a blocker for running spark on top > of YARN. > > Thanks. > Deb > > On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <[email protected]> wrote: > >> Hi Deb, >> >> I think this may be the same issue as described in >> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the >> container got killed by YARN because it used much more memory that it >> requested. But we haven't figured out the root cause yet. >> >> +Sandy >> >> Best, >> Xiangrui >> >> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <[email protected]> >> wrote: >> > Hi, >> > >> > During the 4th ALS iteration, I am noticing that one of the executor >> gets >> > disconnected: >> > >> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding >> > SendingConnectionManagerId not found >> > >> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 >> > disconnected, so removing it >> > >> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost >> executor 5 >> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated >> > >> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch >> 12) >> > Any idea if this is a bug related to akka on YARN ? >> > >> > I am using master >> > >> > Thanks. >> > Deb >> > >
