Yes I will time the phases.  My largest dataset is only a couple of gigs
currently, I ran into the 5G limit on Amazon S3 and need to find a work
around.  But I figured that might be large enough to see scaling using the
small instances but maybe not.  I will work on these issues and see what
happens, thanks for you help Dmitriy.

Nathan

On Tue, Nov 29, 2011 at 3:24 PM, Dmitriy Lyubimov <[email protected]> wrote:

> ok thanks. I will file an issue for default p.
>
> also i updated the docs re: --reduceTasks.
>
> it would be nice if you could log time for map and reduce phases for
> all tasks (it is reported in MR web ui at namenode:50030 by default)
> in each case if you think there's a performance issue. It would at
> least allow to narrow any problem to a particular part of computation.
> My datasets are too small ~10G, and i run them for a rather small k,
> at that size i don't see any visible irregularties.
>
> Thanks.
> -Dmitriy
>
> On Tue, Nov 29, 2011 at 2:12 PM, Nathan Halko <[email protected]>
> wrote:
> > Thanks for the heads up with numReduceTasks.  I haven't changed the
> > parameters yet much from the default so this is probably my problem.
> >
> > By slave I mean machine, I'm running an m1.small as master and either
> > m1.small's or m1.large's as slaves (datanode, tasktracker, child).
> >
> > p depends mostly on the decay of singular values rather than the rank k.
> >  In fact (in the analysis at least) it is completely independent of k.
>  The
> > quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this
> should
> > be large.  Ideally we would set p as a function of this parameter which
> is
> > dependent on the matrix (and unknown until we have already solved the
> > problem  :-) ).  I suggest 25 since for example tf-idf matrices have a
> low
> > sig/noise ratio.  You could probably for some cases use less, if you
> need p
> > to be larger you probably need a power iteration so it seems to be a good
> > default point.  Also the parameter is not an initial point of
> optimization
> > so to error on the larger side is fine.  After all, Lanczos method
> suggests
> > that only 1/3 of singular triplets are accurate, which corresponds to
> p=2k,
> > which is very large.  Basically, the exact value of p is insensitive so
> > long as it is large 'enough'.
> >
> > On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <[email protected]
> >wrote:
> >
> >> PPS also make sure you specify numReduceTasks. Default is I beleive 1
> >> which will not scale at multiplication steps at all.
> >>
> >> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <[email protected]>
> >> wrote:
> >> > PS actually i think it should scale horizontally a little better than
> >> > vertically but that's just a guess.
> >> >
> >> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <[email protected]
> >
> >> wrote:
> >> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <
> [email protected]>
> >> wrote:
> >> >>>
> >> >>> The docs look great Dmitriy.  Has anyone considered giving
> oversampling
> >> >>> ssvd over lanczos which is promising.  Trying to scale out
> >> horizontally but
> >> >>> not seeing any difference between using one slave or many slaves.
>  Any
> >> >>> ideas? (I won't go into detail about the setup here but if sounds
> >> familiar
> >> >>> I'd like to talk more).
> >> >>
> >> >> What do you mean by a slave? a mapper? a machine?
> >> >>
> >> >> whether you increase input horizontally or vertically, you should see
> >> >> more mappers. If your cluster has enough capacity to scheudle all
> >> >> mappers right away, i beleive you will get almost the same time (i.e.
> >> >> almost linear scaling) for most of the jobs.
> >> >>
> >> >>> The basic problem with lanczos in the distributed
> >> >>> environment seems to be that a matrix-vector multiply is not enough
> >> work to
> >> >>> offset any setup costs, also there is not a distributed
> >> orthogonalization
> >> >>> with lanczos and I'm getting OOM's making it difficult to scale.  I
> >> would
> >> >>> still like to contribute what results I have found but I'm short on
> >> time so
> >> >>> nothing besides work directly related to the completion of my thesis
> >> will
> >> >>> happen until that is done.
> >> >>>
> >> >>
> >> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <
> [email protected]>
> >> wrote:
> >> >>>
> >> >>> > I attached the latex source as well (lyx, actually). I would've
> used
> >> >>> > Wiki if it supported mathjax. So anyone can modify the usage if
> need
> >> >>> > be. (Anyone who has lyx anyway).
> >> >>> >
> >> >>> > Dev docs were attached to several jira issues (and i had blog
> >> >>> > entries), if you want to move more recent copies of them moved
>  over
> >> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working
> notes,
> >> >>> > one for original method, and another for power iterations,
> attached
> >> to
> >> >>> > corresponding jiras.
> >> >>> >
> >> >>> >
> >> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
> >> [email protected]>
> >> >>> > wrote:
> >> >>> > > I hooked it into the Algorithms page.
> >> >>> > >
> >> >>> > > How do you intend to keep the PDF up to date?  I like the focus
> >> more on
> >> >>> > the user, but it would also be good to have some dev docs.
> >> >>> > >
> >> >>> > > Also, with both Lanczos and this it would be good if we could
> hook
> >> them
> >> >>> > into some real examples.
> >> >>> > >
> >> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> >> >>> > >
> >> >>> > >> Hi,
> >> >>> > >>
> >> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd
> appreciate
> >> if
> >> >>> > >> somebody else could look thru it, to scan for completeness and
> >> >>> > >> suggestions.
> >> >>> > >>
> >> >>> > >> I tried to approach it as a user-facing documentation, i.e. I
> >> tried to
> >> >>> > >> avoid discussing any implementation specifics .
> >> >>> > >>
> >> >>> > >> I had several users and Nathan Halko trying it out and actually
> >> >>> > >> favorably commenting on its scalability vs. Lanczos but i don't
> >> know
> >> >>> > >> first hand of any production use (even our own use is fairly
> >> limited
> >> >>> > >> (in terms of input volume we ever processed) and actually
> somewhat
> >> >>> > >> diverged from this Mahout implementation. Perhaps putting it
> more
> >> in
> >> >>> > >> front of users will help to receive more feedback.
> >> >>> > >>
> >> >>> > >> Thanks.
> >> >>> > >> -Dmitriy
> >> >>> > >
> >> >>> > > --------------------------------------------
> >> >>> > > Grant Ingersoll
> >> >>> > > http://www.lucidimagination.com
> >> >>> > >
> >> >>> > >
> >> >>> > >
> >> >>> > >
> >> >>> >
> >>
>

Reply via email to