Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Debasish Das Sun, 06 Apr 2014 23:10:32 -0700

Sorry not persist...I meant adding a user parameter k which does checkpoint
after every k iterations...out of N ALS iterations...We have hdfs installed
so not a big deal...is there an issue of adding this user parameter in
ALS.scala ? If it is then I can add it to our internal branch...


For me tipping k seems like 4...With 4 iterations I can write out the
factors...if I run with 10 iterations, after 4 I can see that it restarts
the sparse matrix partition...tries to run all the iterations over again
and fails due to array index out of bound which does not seems like a real
bug...

Not sure if it can be reproduced in movielens as the dataset I have is 25M
x 3M (and counting)...whille movielens is tall and thin....

Another idea would be to give an option to restart ALS with previous
factors...that way ALS core algorithm does not need to change and it might
be more useful...and that way we can point to a location from where the old
factors can be load...I think @sean used similar idea in Oryx generations...

Let me know which way you guys prefer....I can add it in...




On Sun, Apr 6, 2014 at 9:15 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Btw, explicit ALS doesn't need persist because each intermediate
> factor is only used once. -Xiangrui
>
> On Sun, Apr 6, 2014 at 9:13 PM, Xiangrui Meng <men...@gmail.com> wrote:
> > The persist used in implicit ALS doesn't help StackOverflow problem.
> > Persist doesn't cut lineage. We need to call count() and then
> > checkpoint() to cut the lineage. Did you try the workaround mentioned
> > in https://issues.apache.org/jira/browse/SPARK-958:
> >
> > "I tune JVM thread stack size to 512k via option -Xss512k and it works."
> >
> > Best,
> > Xiangrui
> >
> > On Sun, Apr 6, 2014 at 10:21 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> >> At the head I see persist option in implicitPrefs but more cases like
> the
> >> ones mentioned above why don't we use similar technique and take an
> input
> >> that which iteration should we persist in explicit runs as well ?
> >>
> >> for (iter <- 1 to iterations) {
> >>         // perform ALS update
> >>         logInfo("Re-computing I given U (Iteration %d/%d)".format(iter,
> >> iterations))
> >>         products = updateFeatures(users, userOutLinks, productInLinks,
> >> partitioner, rank, lambda,
> >>           alpha, YtY = None)
> >>         logInfo("Re-computing U given I (Iteration %d/%d)".format(iter,
> >> iterations))
> >>         users = updateFeatures(products, productOutLinks, userInLinks,
> >> partitioner, rank, lambda,
> >>           alpha, YtY = None)
> >>       }
> >>
> >> Say if I want to persist at every k iterations out of N iterations of
> ALS
> >> explicit, there shoud be an option to do that...implicit right now uses
> >> persist at each iteration...
> >>
> >> Does this option make sense or you guys want this issue to be fixed in a
> >> different way...
> >>
> >> I definitely see that for my 25M x 3M run, with 64 gb executor memory,
> >> something is going wrong after 5-th iteration and I wanted to run for 10
> >> iterations...
> >>
> >> So my k is 4/5 for this particular problem...
> >>
> >> I can ask for the PR after testing the fix on the dataset I have...I
> will
> >> also try to see if we can make such datasets public for more research...
> >>
> >> For the LDA problem mentioned earlier in this email chain, k is 10...NMF
> >> can generate topics similar to LDA as well...Carrot2 project uses it...
> >>
> >>
> >>
> >> On Thu, Mar 27, 2014 at 3:20 PM, Debasish Das <debasish.da...@gmail.com
> >wrote:
> >>
> >>> Hi Matei,
> >>>
> >>> I am hitting similar problems with 10 ALS iterations...I am running
> with
> >>> 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50
> >>>
> >>> The first iteration of flatMaps run fine which means that the memory
> >>> requirements are good per iteration...
> >>>
> >>> If I do check-pointing on RDD, most likely rest 9 iterations will also
> run
> >>> fine and I will get the results...
> >>>
> >>> Is there a plan to add checkpoint option to ALS for such large
> >>> factorization jobs ?
> >>>
> >>> Thanks.
> >>> Deb
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia <
> matei.zaha...@gmail.com>wrote:
> >>>
> >>>> That would be great to add. Right now it would be easy to change it to
> >>>> use another Hadoop FileSystem implementation at the very least (I
> think you
> >>>> can just pass the URL for that), but for Cassandra you'd have to use a
> >>>> different InputFormat or some direct Cassandra access API.
> >>>>
> >>>> Matei
> >>>>
> >>>> On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote:
> >>>>
> >>>> > By the way, is there any plan to make a pluggable backend for
> >>>> > checkpointing?   We might be interested in writing a, for example,
> >>>> > Cassandra backend.
> >>>> >
> >>>> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan <
> junluan....@intel.com>
> >>>> wrote:
> >>>> >> Hi all
> >>>> >>
> >>>> >> The description about this Bug submitted by Matei is as following
> >>>> >>
> >>>> >>
> >>>> >> The tipping point seems to be around 50. We should fix this by
> >>>> checkpointing the RDDs every 10-20 iterations to break the lineage
> chain,
> >>>> but checkpointing currently requires HDFS installed, which not all
> users
> >>>> will have.
> >>>> >>
> >>>> >> We might also be able to fix DAGScheduler to not be recursive.
> >>>> >>
> >>>> >>
> >>>> >> regards,
> >>>> >> Andrew
> >>>> >>
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> > --
> >>>> > Evan Chan
> >>>> > Staff Engineer
> >>>> > e...@ooyala.com  |
> >>>>
> >>>>
> >>>
>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to