In my opinion the most important take-away from MapReduce is that you dont move the data, but move the "computation towards the data". So I implemented a persistent storage, and moved most of the critical computations directly into the storage ( I did play around with Hama before this).
Right now I dont have any problem in keeping all 8 cores occupied. I think MapReduce helps reduce the network IOPS because you have data stored locally at the node. Even in matrix multiplication where I need to duplicate one table across all the nodes, I get to do this in around 10 to 15 minutes and then follow it by 45 minutes of computation. With 34 systems, I get a combined network transfer rate of around 5.8 Gigabits per second on a gigabit LAN. After all individual links in a switched network can transmit independently. I noticed that network speeds scale faster than linearly if I am lucky enough to schedule transfers properly (I use a simple random permutation on each node to decide the sequence of network transfers). Scheduling the IOPS carefully makes a big difference in a distributed system, but I still have disk access as a bottleneck. I currently have all matrices stored on SAS disks (15k rpm). I have noticed that reading a single file from multiple threads makes a big difference on these disks (unlike the normal 7.2 rpm disks). IO devices have become a lot faster but it does make programming a lot harder. I am using a fairly naive matrix multiplication algorithm : A = B * C where the rows of A are scaled sums of the rows of C. I found this reasonably easy to parallelize, and the access patterns have good cache locality. On Mon, Apr 12, 2010 at 2:23 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Without outrageous magic FFT algorithms, nxn times nxn matrix multiply > requires k_1 n^3 flops and at least k_2 n^2 I/O operations. If you can > arrange to re-use every operand a LOT, then you only have to do 2 n^2 IOPS, > but if you fail in that, then I/O tends toward n^3 as well. If you do > manage to get sufficient re-use, then the fact that FLOPS are faster than > IOPS let's us be CPU-bound. In a single machine, IOPS are pretty fast so > you can saturate the CPU more easily than some other case. > > For map-reduce on multiple machines, the IOPS/FLOPS ratio becomes evil and > keeping the CPU loaded becomes much harder. In general, it is hard to > transfer even 50MBytes/s into a single mapper which is only 5M floating > point numbers per second. At a perfect 40K FLOPS per number, this is only > 200MFlops per CPU (with several cores!!). Well tuned dense multiply > routines should crank out about 2-3GFlops per core. Thus, for an 8 core > machine, we have a big deficiency to overcome before Hadoop style map-reduce > is practical for this. > > For lots of algorithms, this gets even worse. The convergence rate of > on-line gradient suffers badly with any kind of delay between computation of > the gradient and its application to the model. Batching is a kind of > delay. Moreover, a single CPU is often entirely sufficient to saturate I/O > devices. This means that parallelizing the algorithm gives almost no wall > clock convergence speedup at all even though all CPU's might be very, very > busy. > > On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <steven.b...@gmail.com> wrote: > >> If you're just doing matrix multiplication, I would advise that mahout >> (or any mapreduce approach) isn't well suited to your problem. I did >> the same computation with matlab (multiplying two 40k x 40k random >> double precision dense matrices) using 14 cores and about 36GB of ram >> on a single machine* and it finished in about 55 minutes. If I'm >> reading your email correctly, you were working with 34*2*4=272 cores! >> I'm not sure if dense matrix multiplication can actually be >> efficiently mapreduced, but I am still a rookie so don't take my word >> for it. >> >> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz >> per core, with 64GB total system memory. >> >> Steven Buss >> steven.b...@gmail.com >> http://www.stevenbuss.com/ >> >> >> >> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> > Vimal, >> > >> > We don't have any distributed dense multiplication operations because we >> > have not yet found much application demand for distributed dense matrix >> > multiplication. Distributed sparse matrix operations are a big deal, >> > however. >> > >> > If you are interested in working on the problem in the context of Mahout, >> we >> > would love to help. This is especially true if you have an application >> that >> > needs dense operations and could benefit from some of the other >> capabilities >> > in Mahout. >> > >> > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vml.mat...@gmail.com> >> wrote: >> > >> >> Hi, >> >> What's the current state of matrix-matrix multiplication in Mahout? >> >> Are there any performance results available for large matrices? >> >> >> >> I have been working on a Hadoop-compatible distributed storage for >> >> matrices. I can currently multiply two 40K x 40K dense double >> >> precision matrices in around 1 hour using 34 systems (16GB RAM, two >> >> Core2Quads' per node). I was wondering how this compares with Mahout. >> >> >> >> Regards, >> >> Vimal >> >> >> > >> >