In my test I found this phenomenon might be caused by RDD's long dependency chain, this dependency chain is serialized into task and sent to each executor, while deserializing this task will cause stack overflow.
Especially in iterative job, like: var rdd = .. for (i <- 0 to 100) rdd = rdd.map(x=>x) rdd = rdd.cache Here rdd's dependency will be chained, at some point stack overflow will occur. You can check (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ) and (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ) for details. Current workaround method is to cut the dependency chain by checkpointing RDD, maybe a better way is to clean the dependency chain after materialize stage is executed. Thanks Jerry -----Original Message----- From: Reynold Xin [mailto:r...@databricks.com] Sent: Sunday, January 26, 2014 2:04 PM To: dev@spark.incubator.apache.org Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"? I'm not entirely sure, but two candidates are the visit function in stageDependsOn submitStage On Sat, Jan 25, 2014 at 10:01 PM, Aaron Davidson <ilike...@gmail.com> wrote: > I'm an idiot, but which part of the DAGScheduler is recursive here? > Seems like processEvent shouldn't have inherently recursive properties. > > > On Sat, Jan 25, 2014 at 9:57 PM, Reynold Xin <r...@databricks.com> wrote: > > > It seems to me fixing DAGScheduler to make it not recursive is the > > better solution here, given the cost of checkpointing. > > > > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan > > <junluan....@intel.com> > > wrote: > > > > > Hi all > > > > > > The description about this Bug submitted by Matei is as following > > > > > > > > > The tipping point seems to be around 50. We should fix this by > > > checkpointing the RDDs every 10-20 iterations to break the lineage > chain, > > > but checkpointing currently requires HDFS installed, which not all > users > > > will have. > > > > > > We might also be able to fix DAGScheduler to not be recursive. > > > > > > > > > regards, > > > Andrew > > > > > > > > >