One thing made me very confused during debuggin is the error message. The important one
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@xxx:50278] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. is of Log Level WARN. Jianshi On Tue, Oct 14, 2014 at 4:36 AM, Jianshi Huang <jianshi.hu...@gmail.com> wrote: > Turned out it was caused by this issue: > https://issues.apache.org/jira/browse/SPARK-3923 > > Set spark.akka.heartbeat.interval to 100 solved it. > > Jianshi > > On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > >> Hmm... it failed again, just lasted a little bit longer. >> >> Jianshi >> >> On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang <jianshi.hu...@gmail.com> >> wrote: >> >>> https://issues.apache.org/jira/browse/SPARK-3106 >>> >>> I'm having the saming errors described in SPARK-3106 (no other types of >>> errors confirmed), running a bunch sql queries on spark 1.2.0 built from >>> latest master HEAD. >>> >>> Any updates to this issue? >>> >>> My main task is to join a huge fact table with a dozen dim tables (using >>> HiveContext) and then map it to my class object. It failed a couple of >>> times and now I cached the intermediate table and currently it seems >>> working fine... no idea why until I found SPARK-3106 >>> >>> Cheers, >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >> >> -- >> Jianshi Huang >> >> LinkedIn: jianshi >> Twitter: @jshuang >> Github & Blog: http://huangjs.github.com/ >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/