Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Sean Owen Sun, 26 Jan 2014 00:44:53 -0800

I think "it depends" a fair bit here.

That's a good default absolute convergence cutoff, although it's not crazy
to want to run to further convergence since +/- 0.001 can make a difference
in top-N recommendations that is noticeable, and it can seem weird that
it's 'converged' while answers are non-trivially changing.


And this depends on how fast it converges, and that can be influenced by
the scale of the data (vs rank) and lambda.

50 is generally a lot although it's not going to get the same error as 5 x
10-iteration runs. I am sure it will be necessary for Spark to succeed on
50 iterations of something, which is not news.

I was hoping to propose for ALS was a convergence criterion, since indeed I
think most cases will converge much faster. Then the iterations param can
be "max iterations" instead, and at least it's harder to make the thing try
to do 50+ iterations.

(We are also looking to rebuild some other related functions on top, like
running N models at once -- for the cost of more computation should get you
to a better solution in fewer iterations over all the models. Sort of
helps.)


So far this message has not been relevant to the original issue. $0.02:
This problem is, I think, one that is more likely to come up at scale.
Those are environments where people are probably running on a cluster,
which has HDFS. If the checkpoint answer needs HDFS, it may be just fine to
solve the problem with checkpointing only where checkpointing is available.


--
Sean Owen | Director, Data Science | London


On Sun, Jan 26, 2014 at 8:21 AM, Nick Pentreath <nick.pentre...@gmail.com>wrote:

> Agree that it should be fixed if possible. But why run ALS for 50
> iterations? It tends to pretty much converge (to within 0.001 or so RMSE)
> after 5-10 and even 20 is probably overkill.—
> Sent from Mailbox for iPhone
>
> On Sun, Jan 26, 2014 at 9:59 AM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
> > I looked into this after I opened that JIRA and it’s actually a bit
> harder to fix. While changing these visit() calls to use a stack manually
> instead of being recursive helps avoid a StackOverflowError there, you
> still get a StackOverflowError when you send the task to a worker node
> because Java serialization uses recursion. The only real fix therefore with
> the current codebase is to increase your JVM stack size. Longer-term, I’d
> like us to automatically call checkpoint() to break lineage graphs when
> they exceed a certain size, which would avoid the problems in both
> DAGScheduler and Java serialization. We could also manually add this to ALS
> now without having a solution for other programs. That would be a great
> change to make to fix this JIRA.
> > Matei
> > On Jan 25, 2014, at 11:06 PM, Ewen Cheslack-Postava <m...@ewencp.org>
> wrote:
> >> The three obvious ones in DAGScheduler.scala are in:
> >>
> >> getParentStages
> >> getMissingParentStages
> >> stageDependsOn
> >>
> >> They all follow the same pattern though (def visit(), followed by
> visit(root)), so they should be easy to replace with a Scala stack in place
> of the call stack.
> >>
> >>>     Shao, Saisai    January 25, 2014 at 10:52 PM
> >>> In my test I found this phenomenon might be caused by RDD's long
> dependency chain, this dependency chain is serialized into task and sent to
> each executor, while deserializing this task will cause stack overflow.
> >>>
> >>> Especially in iterative job, like:
> >>> var rdd = ..
> >>>
> >>> for (i <- 0 to 100)
> >>> rdd = rdd.map(x=>x)
> >>>
> >>> rdd = rdd.cache
> >>>
> >>> Here rdd's dependency will be chained, at some point stack overflow
> will occur.
> >>>
> >>> You can check (
> https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/-Cyfe3G6VwY/PFFnslzWn6AJ)
> and (
> https://groups.google.com/forum/?fromgroups#!searchin/spark-users/dependency/spark-users/NkxcmmS-DbM/c9qvuShbHEUJ)
> for details. Current workaround method is to cut the dependency chain by
> checkpointing RDD, maybe a better way is to clean the dependency chain
> after materialize stage is executed.
> >>>
> >>> Thanks
> >>> Jerry
> >>>
> >>> -----Original Message-----
> >>> From: Reynold Xin [mailto:r...@databricks.com]
> >>> Sent: Sunday, January 26, 2014 2:04 PM
> >>> To: dev@spark.incubator.apache.org
> >>> Subject: Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack
> overflow with too many iterations"?
> >>>
> >>> I'm not entirely sure, but two candidates are
> >>>
> >>> the visit function in stageDependsOn
> >>>
> >>> submitStage
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>     Reynold Xin     January 25, 2014 at 10:03 PM
> >>> I'm not entirely sure, but two candidates are
> >>>
> >>> the visit function in stageDependsOn
> >>>
> >>> submitStage
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>     Aaron Davidson  January 25, 2014 at 10:01 PM
> >>> I'm an idiot, but which part of the DAGScheduler is recursive here?
> Seems
> >>> like processEvent shouldn't have inherently recursive properties.
> >>>
> >>>
> >>>
> >>>     Reynold Xin     January 25, 2014 at 9:57 PM
> >>> It seems to me fixing DAGScheduler to make it not recursive is the
> better
> >>> solution here, given the cost of checkpointing.
> >>>
> >>>
> >>>     Xia, Junluan    January 25, 2014 at 9:49 PM
> >>> Hi all
> >>>
> >>> The description about this Bug submitted by Matei is as following
> >>>
> >>>
> >>> The tipping point seems to be around 50. We should fix this by
> checkpointing the RDDs every 10-20 iterations to break the lineage chain,
> but checkpointing currently requires HDFS installed, which not all users
> will have.
> >>>
> >>> We might also be able to fix DAGScheduler to not be recursive.
> >>>
> >>>
> >>> regards,
> >>> Andrew
> >>>
> >>>
>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to