Re: Current state of (dense) matrix multiplication?

Ted Dunning Sun, 11 Apr 2010 23:24:09 -0700

Without outrageous magic FFT algorithms, nxn times nxn matrix multiply
requires k_1 n^3 flops and at least k_2 n^2 I/O operations.  If you can
arrange to re-use every operand a LOT, then you only have to do 2 n^2 IOPS,
but if you fail in that, then I/O tends toward n^3 as well.  If you do
manage to get sufficient re-use, then the fact that FLOPS are faster than
IOPS let's us be CPU-bound.  In a single machine, IOPS are pretty fast so
you can saturate the CPU more easily than some other case.

For map-reduce on multiple machines, the IOPS/FLOPS ratio becomes evil and
keeping the CPU loaded becomes much harder.   In general, it is hard to
transfer even 50MBytes/s into a single mapper which is only 5M floating
point numbers per second.  At a perfect 40K FLOPS per number, this is only
200MFlops per CPU (with several cores!!).  Well tuned dense multiply
routines should crank out about 2-3GFlops per core.  Thus, for an 8 core
machine, we have a big deficiency to overcome before Hadoop style map-reduce
is practical for this.

For lots of algorithms, this gets even worse.  The convergence rate of
on-line gradient suffers badly with any kind of delay between computation of
the gradient and its application to the model.  Batching is a kind of
delay.  Moreover, a single CPU is often entirely sufficient to saturate I/O
devices.  This means that parallelizing the algorithm gives almost no wall
clock convergence speedup at all even though all CPU's might be very, very
busy.

On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <steven.b...@gmail.com> wrote:

> If you're just doing matrix multiplication, I would advise that mahout
> (or any mapreduce approach) isn't well suited to your problem. I did
> the same computation with matlab (multiplying two 40k x 40k random
> double precision dense matrices) using 14 cores and about 36GB of ram
> on a single machine* and it finished in about 55 minutes. If I'm
> reading your email correctly, you were working with 34*2*4=272 cores!
> I'm not sure if dense matrix multiplication can actually be
> efficiently mapreduced, but I am still a rookie so don't take my word
> for it.
>
> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
> per core, with 64GB total system memory.
>
> Steven Buss
> steven.b...@gmail.com
> http://www.stevenbuss.com/
>
>
>
> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> > Vimal,
> >
> > We don't have any distributed dense multiplication operations because we
> > have not yet found much application demand for distributed dense matrix
> > multiplication.  Distributed sparse matrix operations are a big deal,
> > however.
> >
> > If you are interested in working on the problem in the context of Mahout,
> we
> > would love to help.  This is especially true if you have an application
> that
> > needs dense operations and could benefit from some of the other
> capabilities
> > in Mahout.
> >
> > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vml.mat...@gmail.com>
> wrote:
> >
> >> Hi,
> >>  What's the current state of matrix-matrix multiplication in Mahout?
> >> Are there any performance results available for large matrices?
> >>
> >>  I have been working on a Hadoop-compatible distributed storage for
> >> matrices. I can currently multiply two 40K x 40K dense double
> >> precision matrices in around 1 hour using 34 systems (16GB RAM, two
> >> Core2Quads' per node). I was wondering how this compares with Mahout.
> >>
> >> Regards,
> >>  Vimal
> >>
> >
>

Re: Current state of (dense) matrix multiplication?

Reply via email to