In my test I found this phenomenon might be caused by RDD's long dependency 
chain, this dependency chain is serialized into task and sent to each executor, 
while deserializing this task will cause stack overflow.

Especially in iterative job, like:
var rdd = ..

for (i <- 0 to 100)
 rdd = rdd.map(x=>x)

rdd = rdd.cache

Here rdd's dependency will be chained, at some point stack overflow will occur.

You can check 
(https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
 and 
(https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
 for details. Current workaround method is to cut the dependency chain by 
checkpointing RDD, maybe a better way is to clean the dependency chain after 
materialize stage is executed.

Thanks
Jerry

-----Original Message-----
From: Reynold Xin [mailto:r...@databricks.com] 
Sent: Sunday, January 26, 2014 2:04 PM
To: dev@spark.incubator.apache.org
Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with 
too many iterations"?

I'm not entirely sure, but two candidates are

the visit function in stageDependsOn

submitStage






On Sat, Jan 25, 2014 at 10:01 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> I'm an idiot, but which part of the DAGScheduler is recursive here? 
> Seems like processEvent shouldn't have inherently recursive properties.
>
>
> On Sat, Jan 25, 2014 at 9:57 PM, Reynold Xin <r...@databricks.com> wrote:
>
> > It seems to me fixing DAGScheduler to make it not recursive is the 
> > better solution here, given the cost of checkpointing.
> >
> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan 
> > <junluan....@intel.com>
> > wrote:
> >
> > > Hi all
> > >
> > > The description about this Bug submitted by Matei is as following
> > >
> > >
> > > The tipping point seems to be around 50. We should fix this by 
> > > checkpointing the RDDs every 10-20 iterations to break the lineage
> chain,
> > > but checkpointing currently requires HDFS installed, which not all
> users
> > > will have.
> > >
> > > We might also be able to fix DAGScheduler to not be recursive.
> > >
> > >
> > > regards,
> > > Andrew
> > >
> > >
> >
>

Reply via email to