Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Matei Zaharia Sat, 25 Jan 2014 23:59:42 -0800

I looked into this after I opened that JIRA and it’s actually a bit harder to fix. While changing these visit() calls to use a stack manually instead of being recursive helps avoid a StackOverflowError there, you still get a StackOverflowError when you send the task to a worker node because Java serialization uses recursion. The only real fix therefore with the current codebase is to increase your JVM stack size. Longer-term, I’d like us to automatically call checkpoint() to break lineage graphs when they exceed a certain size, which would avoid the problems in both DAGScheduler and Java serialization. We could also manually add this to ALS now without having a solution for other programs. That would be a great change to make to fix this JIRA.

Matei

On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <m...@ewencp.org> wrote:

The three obvious ones in DAGScheduler.scala are in:

getParentStages
getMissingParentStages
stageDependsOn

They all follow the same pattern though (def visit(), followed by visit(root)), so they should be easy to replace with a Scala stack in place of the call stack.

Shao, Saisai

January 25, 2014 at 10:52 PM

In my test I found this phenomenon might be caused by RDD's long dependency chain, this dependency chain is serialized into task and sent to each executor, while deserializing this task will cause stack overflow.

Especially in iterative job, like:
var rdd = ..

for (i <- 0 to 100)
rdd = rdd.map(x=>x)

rdd = rdd.cache

Here rdd's dependency will be chained, at some point stack overflow will occur.

You can check (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ) and (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ) for details. Current workaround method is to cut the dependency chain by checkpointing RDD, maybe a better way is to clean the dependency chain after materialize stage is executed.

Thanks
Jerry

-----Original Message-----
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Sunday, January 26, 2014 2:04 PM
To: dev@spark.incubator.apache.org
Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

I'm not entirely sure, but two candidates are

the visit function in stageDependsOn

submitStage

Reynold Xin

January 25, 2014 at 10:03 PM

I'm not entirely sure, but two candidates are

the visit function in stageDependsOn

submitStage

Aaron Davidson

January 25, 2014 at 10:01 PM

I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
like processEvent shouldn't have inherently recursive properties.

Reynold Xin

January 25, 2014 at 9:57 PM

It seems to me fixing DAGScheduler to make it not recursive is the better
solution here, given the cost of checkpointing.

Xia, Junluan

January 25, 2014 at 9:49 PM

Hi all

The description about this Bug submitted by Matei is as following

The tipping point seems to be around 50. We should fix this by checkpointing the RDDs every 10-20 iterations to break the lineage chain, but checkpointing currently requires HDFS installed, which not all users will have.

We might also be able to fix DAGScheduler to not be recursive.

regards,
Andrew

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to