I think "it depends" a fair bit here. That's a good default absolute convergence cutoff, although it's not crazy to want to run to further convergence since +/- 0.001 can make a difference in top-N recommendations that is noticeable, and it can seem weird that it's 'converged' while answers are non-trivially changing.
And this depends on how fast it converges, and that can be influenced by the scale of the data (vs rank) and lambda. 50 is generally a lot although it's not going to get the same error as 5 x 10-iteration runs. I am sure it will be necessary for Spark to succeed on 50 iterations of something, which is not news. I was hoping to propose for ALS was a convergence criterion, since indeed I think most cases will converge much faster. Then the iterations param can be "max iterations" instead, and at least it's harder to make the thing try to do 50+ iterations. (We are also looking to rebuild some other related functions on top, like running N models at once -- for the cost of more computation should get you to a better solution in fewer iterations over all the models. Sort of helps.) So far this message has not been relevant to the original issue. $0.02: This problem is, I think, one that is more likely to come up at scale. Those are environments where people are probably running on a cluster, which has HDFS. If the checkpoint answer needs HDFS, it may be just fine to solve the problem with checkpointing only where checkpointing is available. -- Sean Owen | Director, Data Science | London On Sun, Jan 26, 2014 at 8:21 AM, Nick Pentreath <nick.pentre...@gmail.com>wrote: > Agree that it should be fixed if possible. But why run ALS for 50 > iterations? It tends to pretty much converge (to within 0.001 or so RMSE) > after 5-10 and even 20 is probably overkill.— > Sent from Mailbox for iPhone > > On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > > > I looked into this after I opened that JIRA and it’s actually a bit > harder to fix. While changing these visit() calls to use a stack manually > instead of being recursive helps avoid a StackOverflowError there, you > still get a StackOverflowError when you send the task to a worker node > because Java serialization uses recursion. The only real fix therefore with > the current codebase is to increase your JVM stack size. Longer-term, I’d > like us to automatically call checkpoint() to break lineage graphs when > they exceed a certain size, which would avoid the problems in both > DAGScheduler and Java serialization. We could also manually add this to ALS > now without having a solution for other programs. That would be a great > change to make to fix this JIRA. > > Matei > > On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <m...@ewencp.org> > wrote: > >> The three obvious ones in DAGScheduler.scala are in: > >> > >> getParentStages > >> getMissingParentStages > >> stageDependsOn > >> > >> They all follow the same pattern though (def visit(), followed by > visit(root)), so they should be easy to replace with a Scala stack in place > of the call stack. > >> > >>> Shao, Saisai January 25, 2014 at 10:52 PM > >>> In my test I found this phenomenon might be caused by RDD's long > dependency chain, this dependency chain is serialized into task and sent to > each executor, while deserializing this task will cause stack overflow. > >>> > >>> Especially in iterative job, like: > >>> var rdd = .. > >>> > >>> for (i <- 0 to 100) > >>> rdd = rdd.map(x=>x) > >>> > >>> rdd = rdd.cache > >>> > >>> Here rdd's dependency will be chained, at some point stack overflow > will occur. > >>> > >>> You can check ( > https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ) > and ( > https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ) > for details. Current workaround method is to cut the dependency chain by > checkpointing RDD, maybe a better way is to clean the dependency chain > after materialize stage is executed. > >>> > >>> Thanks > >>> Jerry > >>> > >>> -----Original Message----- > >>> From: Reynold Xin [mailto:r...@databricks.com] > >>> Sent: Sunday, January 26, 2014 2:04 PM > >>> To: dev@spark.incubator.apache.org > >>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack > overflow with too many iterations"? > >>> > >>> I'm not entirely sure, but two candidates are > >>> > >>> the visit function in stageDependsOn > >>> > >>> submitStage > >>> > >>> > >>> > >>> > >>> > >>> > >>> Reynold Xin January 25, 2014 at 10:03 PM > >>> I'm not entirely sure, but two candidates are > >>> > >>> the visit function in stageDependsOn > >>> > >>> submitStage > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Aaron Davidson January 25, 2014 at 10:01 PM > >>> I'm an idiot, but which part of the DAGScheduler is recursive here? > Seems > >>> like processEvent shouldn't have inherently recursive properties. > >>> > >>> > >>> > >>> Reynold Xin January 25, 2014 at 9:57 PM > >>> It seems to me fixing DAGScheduler to make it not recursive is the > better > >>> solution here, given the cost of checkpointing. > >>> > >>> > >>> Xia, Junluan January 25, 2014 at 9:49 PM > >>> Hi all > >>> > >>> The description about this Bug submitted by Matei is as following > >>> > >>> > >>> The tipping point seems to be around 50. We should fix this by > checkpointing the RDDs every 10-20 iterations to break the lineage chain, > but checkpointing currently requires HDFS installed, which not all users > will have. > >>> > >>> We might also be able to fix DAGScheduler to not be recursive. > >>> > >>> > >>> regards, > >>> Andrew > >>> > >>> >