Thanks for the heads up with numReduceTasks.  I haven't changed the
parameters yet much from the default so this is probably my problem.

By slave I mean machine, I'm running an m1.small as master and either
m1.small's or m1.large's as slaves (datanode, tasktracker, child).

p depends mostly on the decay of singular values rather than the rank k.
 In fact (in the analysis at least) it is completely independent of k.  The
quantity of interest is sig_k/sig_k+p, (signal to noise ratio) this should
be large.  Ideally we would set p as a function of this parameter which is
dependent on the matrix (and unknown until we have already solved the
problem  :-) ).  I suggest 25 since for example tf-idf matrices have a low
sig/noise ratio.  You could probably for some cases use less, if you need p
to be larger you probably need a power iteration so it seems to be a good
default point.  Also the parameter is not an initial point of optimization
so to error on the larger side is fine.  After all, Lanczos method suggests
that only 1/3 of singular triplets are accurate, which corresponds to p=2k,
which is very large.  Basically, the exact value of p is insensitive so
long as it is large 'enough'.

On Tue, Nov 29, 2011 at 12:32 PM, Dmitriy Lyubimov <[email protected]>wrote:

> PPS also make sure you specify numReduceTasks. Default is I beleive 1
> which will not scale at multiplication steps at all.
>
> On Tue, Nov 29, 2011 at 10:15 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> > PS actually i think it should scale horizontally a little better than
> > vertically but that's just a guess.
> >
> > On Tue, Nov 29, 2011 at 10:10 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >> On Tue, Nov 29, 2011 at 9:56 AM, Nathan Halko <[email protected]>
> wrote:
> >>>
> >>> The docs look great Dmitriy.  Has anyone considered giving oversampling
> >>> ssvd over lanczos which is promising.  Trying to scale out
> horizontally but
> >>> not seeing any difference between using one slave or many slaves.  Any
> >>> ideas? (I won't go into detail about the setup here but if sounds
> familiar
> >>> I'd like to talk more).
> >>
> >> What do you mean by a slave? a mapper? a machine?
> >>
> >> whether you increase input horizontally or vertically, you should see
> >> more mappers. If your cluster has enough capacity to scheudle all
> >> mappers right away, i beleive you will get almost the same time (i.e.
> >> almost linear scaling) for most of the jobs.
> >>
> >>> The basic problem with lanczos in the distributed
> >>> environment seems to be that a matrix-vector multiply is not enough
> work to
> >>> offset any setup costs, also there is not a distributed
> orthogonalization
> >>> with lanczos and I'm getting OOM's making it difficult to scale.  I
> would
> >>> still like to contribute what results I have found but I'm short on
> time so
> >>> nothing besides work directly related to the completion of my thesis
> will
> >>> happen until that is done.
> >>>
> >>
> >>> On Fri, Nov 25, 2011 at 5:37 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> >>>
> >>> > I attached the latex source as well (lyx, actually). I would've used
> >>> > Wiki if it supported mathjax. So anyone can modify the usage if need
> >>> > be. (Anyone who has lyx anyway).
> >>> >
> >>> > Dev docs were attached to several jira issues (and i had blog
> >>> > entries), if you want to move more recent copies of them moved  over
> >>> > to wiki, i'd be happy to. Mainly, so far there are 2 working notes,
> >>> > one for original method, and another for power iterations, attached
> to
> >>> > corresponding jiras.
> >>> >
> >>> >
> >>> > On Fri, Nov 25, 2011 at 4:26 PM, Grant Ingersoll <
> [email protected]>
> >>> > wrote:
> >>> > > I hooked it into the Algorithms page.
> >>> > >
> >>> > > How do you intend to keep the PDF up to date?  I like the focus
> more on
> >>> > the user, but it would also be good to have some dev docs.
> >>> > >
> >>> > > Also, with both Lanczos and this it would be good if we could hook
> them
> >>> > into some real examples.
> >>> > >
> >>> > > On Nov 25, 2011, at 5:42 PM, Dmitriy Lyubimov wrote:
> >>> > >
> >>> > >> Hi,
> >>> > >>
> >>> > >> I put a usage and overview doc for SSVD onto wiki. I'd appreciate
> if
> >>> > >> somebody else could look thru it, to scan for completeness and
> >>> > >> suggestions.
> >>> > >>
> >>> > >> I tried to approach it as a user-facing documentation, i.e. I
> tried to
> >>> > >> avoid discussing any implementation specifics .
> >>> > >>
> >>> > >> I had several users and Nathan Halko trying it out and actually
> >>> > >> favorably commenting on its scalability vs. Lanczos but i don't
> know
> >>> > >> first hand of any production use (even our own use is fairly
> limited
> >>> > >> (in terms of input volume we ever processed) and actually somewhat
> >>> > >> diverged from this Mahout implementation. Perhaps putting it more
> in
> >>> > >> front of users will help to receive more feedback.
> >>> > >>
> >>> > >> Thanks.
> >>> > >> -Dmitriy
> >>> > >
> >>> > > --------------------------------------------
> >>> > > Grant Ingersoll
> >>> > > http://www.lucidimagination.com
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> >
>

Reply via email to