Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Nick Pentreath Sun, 26 Jan 2014 00:21:47 -0800

Agree that it should be fixed if possible. But why run ALS for 50 iterations? 
It tends to pretty much converge (to within 0.001 or so RMSE) after 5-10 and 
even 20 is probably overkill.—
Sent from Mailbox for iPhone


On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> I looked into this after I opened that JIRA and it’s actually a bit harder to 
> fix. While changing these visit() calls to use a stack manually instead of 
> being recursive helps avoid a StackOverflowError there, you still get a 
> StackOverflowError when you send the task to a worker node because Java 
> serialization uses recursion. The only real fix therefore with the current 
> codebase is to increase your JVM stack size. Longer-term, I’d like us to 
> automatically call checkpoint() to break lineage graphs when they exceed a 
> certain size, which would avoid the problems in both DAGScheduler and Java 
> serialization. We could also manually add this to ALS now without having a 
> solution for other programs. That would be a great change to make to fix this 
> JIRA.
> Matei
> On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <m...@ewencp.org> wrote:
>> The three obvious ones in DAGScheduler.scala are in:
>> 
>> getParentStages
>> getMissingParentStages
>> stageDependsOn
>> 
>> They all follow the same pattern though (def visit(), followed by 
>> visit(root)), so they should be easy to replace with a Scala stack in place 
>> of the call stack.
>> 
>>>     Shao, Saisai    January 25, 2014 at 10:52 PM
>>> In my test I found this phenomenon might be caused by RDD's long dependency 
>>> chain, this dependency chain is serialized into task and sent to each 
>>> executor, while deserializing this task will cause stack overflow.
>>> 
>>> Especially in iterative job, like:
>>> var rdd = ..
>>> 
>>> for (i <- 0 to 100)
>>> rdd = rdd.map(x=>x)
>>> 
>>> rdd = rdd.cache
>>> 
>>> Here rdd's dependency will be chained, at some point stack overflow will 
>>> occur.
>>> 
>>> You can check 
>>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
>>>  and 
>>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
>>>  for details. Current workaround method is to cut the dependency chain by 
>>> checkpointing RDD, maybe a better way is to clean the dependency chain 
>>> after materialize stage is executed.
>>> 
>>> Thanks
>>> Jerry
>>> 
>>> -----Original Message-----
>>> From: Reynold Xin [mailto:r...@databricks.com] 
>>> Sent: Sunday, January 26, 2014 2:04 PM
>>> To: dev@spark.incubator.apache.org
>>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow 
>>> with too many iterations"?
>>> 
>>> I'm not entirely sure, but two candidates are
>>> 
>>> the visit function in stageDependsOn
>>> 
>>> submitStage
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>     Reynold Xin     January 25, 2014 at 10:03 PM
>>> I'm not entirely sure, but two candidates are
>>> 
>>> the visit function in stageDependsOn
>>> 
>>> submitStage
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>     Aaron Davidson  January 25, 2014 at 10:01 PM
>>> I'm an idiot, but which part of the DAGScheduler is recursive here? Seems
>>> like processEvent shouldn't have inherently recursive properties.
>>> 
>>> 
>>> 
>>>     Reynold Xin     January 25, 2014 at 9:57 PM
>>> It seems to me fixing DAGScheduler to make it not recursive is the better
>>> solution here, given the cost of checkpointing.
>>> 
>>> 
>>>     Xia, Junluan    January 25, 2014 at 9:49 PM
>>> Hi all
>>> 
>>> The description about this Bug submitted by Matei is as following
>>> 
>>> 
>>> The tipping point seems to be around 50. We should fix this by 
>>> checkpointing the RDDs every 10-20 iterations to break the lineage chain, 
>>> but checkpointing currently requires HDFS installed, which not all users 
>>> will have.
>>> 
>>> We might also be able to fix DAGScheduler to not be recursive.
>>> 
>>> 
>>> regards,
>>> Andrew
>>> 
>>>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to