Without outrageous magic FFT algorithms, nxn times nxn matrix multiply requires k_1 n^3 flops and at least k_2 n^2 I/O operations. If you can arrange to re-use every operand a LOT, then you only have to do 2 n^2 IOPS, but if you fail in that, then I/O tends toward n^3 as well. If you do manage to get sufficient re-use, then the fact that FLOPS are faster than IOPS let's us be CPU-bound. In a single machine, IOPS are pretty fast so you can saturate the CPU more easily than some other case.
For map-reduce on multiple machines, the IOPS/FLOPS ratio becomes evil and keeping the CPU loaded becomes much harder. In general, it is hard to transfer even 50MBytes/s into a single mapper which is only 5M floating point numbers per second. At a perfect 40K FLOPS per number, this is only 200MFlops per CPU (with several cores!!). Well tuned dense multiply routines should crank out about 2-3GFlops per core. Thus, for an 8 core machine, we have a big deficiency to overcome before Hadoop style map-reduce is practical for this. For lots of algorithms, this gets even worse. The convergence rate of on-line gradient suffers badly with any kind of delay between computation of the gradient and its application to the model. Batching is a kind of delay. Moreover, a single CPU is often entirely sufficient to saturate I/O devices. This means that parallelizing the algorithm gives almost no wall clock convergence speedup at all even though all CPU's might be very, very busy. On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <steven.b...@gmail.com> wrote: > If you're just doing matrix multiplication, I would advise that mahout > (or any mapreduce approach) isn't well suited to your problem. I did > the same computation with matlab (multiplying two 40k x 40k random > double precision dense matrices) using 14 cores and about 36GB of ram > on a single machine* and it finished in about 55 minutes. If I'm > reading your email correctly, you were working with 34*2*4=272 cores! > I'm not sure if dense matrix multiplication can actually be > efficiently mapreduced, but I am still a rookie so don't take my word > for it. > > *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz > per core, with 64GB total system memory. > > Steven Buss > steven.b...@gmail.com > http://www.stevenbuss.com/ > > > > On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > Vimal, > > > > We don't have any distributed dense multiplication operations because we > > have not yet found much application demand for distributed dense matrix > > multiplication. Distributed sparse matrix operations are a big deal, > > however. > > > > If you are interested in working on the problem in the context of Mahout, > we > > would love to help. This is especially true if you have an application > that > > needs dense operations and could benefit from some of the other > capabilities > > in Mahout. > > > > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vml.mat...@gmail.com> > wrote: > > > >> Hi, > >> What's the current state of matrix-matrix multiplication in Mahout? > >> Are there any performance results available for large matrices? > >> > >> I have been working on a Hadoop-compatible distributed storage for > >> matrices. I can currently multiply two 40K x 40K dense double > >> precision matrices in around 1 hour using 34 systems (16GB RAM, two > >> Core2Quads' per node). I was wondering how this compares with Mahout. > >> > >> Regards, > >> Vimal > >> > > >