Agree that it should be fixed if possible. But why run ALS for 50 iterations? It tends to pretty much converge (to within 0.001 or so RMSE) after 5-10 and even 20 is probably overkill.— Sent from Mailbox for iPhone
On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > I looked into this after I opened that JIRA and it’s actually a bit harder to > fix. While changing these visit() calls to use a stack manually instead of > being recursive helps avoid a StackOverflowError there, you still get a > StackOverflowError when you send the task to a worker node because Java > serialization uses recursion. The only real fix therefore with the current > codebase is to increase your JVM stack size. Longer-term, I’d like us to > automatically call checkpoint() to break lineage graphs when they exceed a > certain size, which would avoid the problems in both DAGScheduler and Java > serialization. We could also manually add this to ALS now without having a > solution for other programs. That would be a great change to make to fix this > JIRA. > Matei > On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <m...@ewencp.org> wrote: >> The three obvious ones in DAGScheduler.scala are in: >> >> getParentStages >> getMissingParentStages >> stageDependsOn >> >> They all follow the same pattern though (def visit(), followed by >> visit(root)), so they should be easy to replace with a Scala stack in place >> of the call stack. >> >>> Shao, Saisai January 25, 2014 at 10:52 PM >>> In my test I found this phenomenon might be caused by RDD's long dependency >>> chain, this dependency chain is serialized into task and sent to each >>> executor, while deserializing this task will cause stack overflow. >>> >>> Especially in iterative job, like: >>> var rdd = .. >>> >>> for (i <- 0 to 100) >>> rdd = rdd.map(x=>x) >>> >>> rdd = rdd.cache >>> >>> Here rdd's dependency will be chained, at some point stack overflow will >>> occur. >>> >>> You can check >>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ) >>> and >>> (https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ) >>> for details. Current workaround method is to cut the dependency chain by >>> checkpointing RDD, maybe a better way is to clean the dependency chain >>> after materialize stage is executed. >>> >>> Thanks >>> Jerry >>> >>> -----Original Message----- >>> From: Reynold Xin [mailto:r...@databricks.com] >>> Sent: Sunday, January 26, 2014 2:04 PM >>> To: dev@spark.incubator.apache.org >>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow >>> with too many iterations"? >>> >>> I'm not entirely sure, but two candidates are >>> >>> the visit function in stageDependsOn >>> >>> submitStage >>> >>> >>> >>> >>> >>> >>> Reynold Xin January 25, 2014 at 10:03 PM >>> I'm not entirely sure, but two candidates are >>> >>> the visit function in stageDependsOn >>> >>> submitStage >>> >>> >>> >>> >>> >>> >>> >>> Aaron Davidson January 25, 2014 at 10:01 PM >>> I'm an idiot, but which part of the DAGScheduler is recursive here? Seems >>> like processEvent shouldn't have inherently recursive properties. >>> >>> >>> >>> Reynold Xin January 25, 2014 at 9:57 PM >>> It seems to me fixing DAGScheduler to make it not recursive is the better >>> solution here, given the cost of checkpointing. >>> >>> >>> Xia, Junluan January 25, 2014 at 9:49 PM >>> Hi all >>> >>> The description about this Bug submitted by Matei is as following >>> >>> >>> The tipping point seems to be around 50. We should fix this by >>> checkpointing the RDDs every 10-20 iterations to break the lineage chain, >>> but checkpointing currently requires HDFS installed, which not all users >>> will have. >>> >>> We might also be able to fix DAGScheduler to not be recursive. >>> >>> >>> regards, >>> Andrew >>> >>>