Can someone from Databricks test and commit this PR? This is not a complete solution, but would provide some relief. https://github.com/apache/spark/pull/1391
Thanks, Nishkam On Wed, Aug 20, 2014 at 12:39 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hi Debasish, > > The fix is to raise spark.yarn.executor.memoryOverhead until this goes > away. This controls the buffer between the JVM heap size and the amount of > memory requested from YARN (JVMs can take up memory beyond their heap > size). You should also make sure that, in the YARN NodeManager > configuration, yarn.nodemanager.vmem-check-enabled is set to false. > > -Sandy > > > On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > > > I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is > > definitely a YARN related problem... > > > > At least for me right now only deployment option possible is > standalone... > > > > > > > > On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <men...@gmail.com> > wrote: > > > >> Hi Deb, > >> > >> I think this may be the same issue as described in > >> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the > >> container got killed by YARN because it used much more memory that it > >> requested. But we haven't figured out the root cause yet. > >> > >> +Sandy > >> > >> Best, > >> Xiangrui > >> > >> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <debasish.da...@gmail.com > > > >> wrote: > >> > Hi, > >> > > >> > During the 4th ALS iteration, I am noticing that one of the executor > >> gets > >> > disconnected: > >> > > >> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding > >> > SendingConnectionManagerId not found > >> > > >> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 > >> > disconnected, so removing it > >> > > >> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost > >> executor 5 > >> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client > disassociated > >> > > >> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch > >> 12) > >> > Any idea if this is a bug related to akka on YARN ? > >> > > >> > I am using master > >> > > >> > Thanks. > >> > Deb > >> > > > > >