Thank you, Nathan. On Wed, Nov 30, 2011 at 9:49 AM, Nathan Halko <[email protected]>wrote:
> Yes I will time the phases. My largest dataset is only a couple of gigs > currently, I ran into the 5G limit on Amazon S3 and need to find a work > around. But I figured that might be large enough to see scaling using the > small instances but maybe not. I will work on these issues and see what > happens, thanks for you help Dmitriy. > > Nathan > > On Tue, Nov 29, 2011 at 3:24 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > ok thanks. I will file an issue for default p. > > > > also i updated the docs re: --reduceTasks. > > > > it would be nice if you could log time for map and reduce phases for > > all tasks (it is reported in MR web ui at namenode:50030 by default) > > in each case if you think there's a performance issue. It would at > > least allow to narrow any problem to a particular part of computation. > > My datasets are too small ~10G, and i run them for a rather small k, > > at that size i don't see any visible irregularties. > > > > Thanks. > > -Dmitriy > > > > On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <[email protected]> > > wrote: > > > Thanks for the heads up with numReduceTasks. I haven't changed the > > > parameters yet much from the default so this is probably my problem. > > > > > > By slave I mean machine, I'm running an m1.small as master and either > > > m1.small's or m1.large's as slaves (datanode, tasktracker, child). > > > > > > p depends mostly on the decay of singular values rather than the rank > k. > > > In fact (in the analysis at least) it is completely independent of k. > > The > > > quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this > > should > > > be large. Ideally we would set p as a function of this parameter which > > is > > > dependent on the matrix (and unknown until we have already solved the > > > problem :-) ). I suggest 25 since for example tf-idf matrices have a > > low > > > sig/noise ratio. You could probably for some cases use less, if you > > need p > > > to be larger you probably need a power iteration so it seems to be a > good > > > default point. Also the parameter is not an initial point of > > optimization > > > so to error on the larger side is fine. After all, Lanczos method > > suggests > > > that only 1/3 of singular triplets are accurate, which corresponds to > > p=2k, > > > which is very large. Basically, the exact value of p is insensitive so > > > long as it is large 'enough'. > > > > > > On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <[email protected] > > >wrote: > > > > > >> PPS also make sure you specify numReduceTasks. Default is I beleive 1 > > >> which will not scale at multiplication steps at all. > > >> > > >> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <[email protected] > > > > >> wrote: > > >> > PS actually i think it should scale horizontally a little better > than > > >> > vertically but that's just a guess. > > >> > > > >> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov < > [email protected] > > > > > >> wrote: > > >> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko < > > [email protected]> > > >> wrote: > > >> >>> > > >> >>> The docs look great Dmitriy. Has anyone considered giving > > oversampling > > >> >>> ssvd over lanczos which is promising. Trying to scale out > > >> horizontally but > > >> >>> not seeing any difference between using one slave or many slaves. > > Any > > >> >>> ideas? (I won't go into detail about the setup here but if sounds > > >> familiar > > >> >>> I'd like to talk more). > > >> >> > > >> >> What do you mean by a slave? a mapper? a machine? > > >> >> > > >> >> whether you increase input horizontally or vertically, you should > see > > >> >> more mappers. If your cluster has enough capacity to scheudle all > > >> >> mappers right away, i beleive you will get almost the same time > (i.e. > > >> >> almost linear scaling) for most of the jobs. > > >> >> > > >> >>> The basic problem with lanczos in the distributed > > >> >>> environment seems to be that a matrix-vector multiply is not > enough > > >> work to > > >> >>> offset any setup costs, also there is not a distributed > > >> orthogonalization > > >> >>> with lanczos and I'm getting OOM's making it difficult to scale. > I > > >> would > > >> >>> still like to contribute what results I have found but I'm short > on > > >> time so > > >> >>> nothing besides work directly related to the completion of my > thesis > > >> will > > >> >>> happen until that is done. > > >> >>> > > >> >> > > >> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov < > > [email protected]> > > >> wrote: > > >> >>> > > >> >>> > I attached the latex source as well (lyx, actually). I would've > > used > > >> >>> > Wiki if it supported mathjax. So anyone can modify the usage if > > need > > >> >>> > be. (Anyone who has lyx anyway). > > >> >>> > > > >> >>> > Dev docs were attached to several jira issues (and i had blog > > >> >>> > entries), if you want to move more recent copies of them moved > > over > > >> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working > > notes, > > >> >>> > one for original method, and another for power iterations, > > attached > > >> to > > >> >>> > corresponding jiras. > > >> >>> > > > >> >>> > > > >> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll < > > >> [email protected]> > > >> >>> > wrote: > > >> >>> > > I hooked it into the Algorithms page. > > >> >>> > > > > >> >>> > > How do you intend to keep the PDF up to date? I like the > focus > > >> more on > > >> >>> > the user, but it would also be good to have some dev docs. > > >> >>> > > > > >> >>> > > Also, with both Lanczos and this it would be good if we could > > hook > > >> them > > >> >>> > into some real examples. > > >> >>> > > > > >> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote: > > >> >>> > > > > >> >>> > >> Hi, > > >> >>> > >> > > >> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd > > appreciate > > >> if > > >> >>> > >> somebody else could look thru it, to scan for completeness > and > > >> >>> > >> suggestions. > > >> >>> > >> > > >> >>> > >> I tried to approach it as a user-facing documentation, i.e. I > > >> tried to > > >> >>> > >> avoid discussing any implementation specifics . > > >> >>> > >> > > >> >>> > >> I had several users and Nathan Halko trying it out and > actually > > >> >>> > >> favorably commenting on its scalability vs. Lanczos but i > don't > > >> know > > >> >>> > >> first hand of any production use (even our own use is fairly > > >> limited > > >> >>> > >> (in terms of input volume we ever processed) and actually > > somewhat > > >> >>> > >> diverged from this Mahout implementation. Perhaps putting it > > more > > >> in > > >> >>> > >> front of users will help to receive more feedback. > > >> >>> > >> > > >> >>> > >> Thanks. > > >> >>> > >> -Dmitriy > > >> >>> > > > > >> >>> > > -------------------------------------------- > > >> >>> > > Grant Ingersoll > > >> >>> > > http://www.lucidimagination.com > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > > > >> > > >
