Re: Current state of (dense) matrix multiplication?

Vimal Mathew Mon, 12 Apr 2010 05:24:22 -0700

In my opinion the most important take-away from MapReduce is that you
dont move the data, but move the "computation towards the data". So I
implemented a persistent storage, and moved most of the critical
computations directly into the storage ( I did play around with Hama
before this).


Right now I dont have any problem in keeping all 8 cores occupied. I
think MapReduce helps reduce the network IOPS because you have data
stored locally at the node. Even in matrix multiplication where I need
to duplicate one table across all the nodes, I get to do this in
around 10 to 15 minutes and then follow it by 45 minutes of
computation. With 34 systems, I get a combined network transfer rate
of around 5.8 Gigabits per second on a gigabit LAN. After all
individual links in a switched network can transmit independently. I
noticed that network speeds scale faster than linearly if I am lucky
enough to schedule transfers properly (I use a  simple random
permutation on each node to decide the sequence of network transfers).

Scheduling the IOPS carefully makes a big difference in a distributed
system, but I still have disk access as a bottleneck.

I currently have all matrices stored on SAS disks (15k rpm). I have
noticed that reading a single file from multiple threads makes a big
difference on these disks (unlike the normal 7.2 rpm disks).  IO
devices have become a lot faster but it does make programming a lot
harder.

I am using a fairly naive matrix multiplication algorithm : A = B * C
where the rows of A are scaled sums of the rows of C. I found this
reasonably easy to parallelize, and the access patterns have good
cache locality.


On Mon, Apr 12, 2010 at 2:23 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Without outrageous magic FFT algorithms, nxn times nxn matrix multiply
> requires k_1 n^3 flops and at least k_2 n^2 I/O operations.  If you can
> arrange to re-use every operand a LOT, then you only have to do 2 n^2 IOPS,
> but if you fail in that, then I/O tends toward n^3 as well.  If you do
> manage to get sufficient re-use, then the fact that FLOPS are faster than
> IOPS let's us be CPU-bound.  In a single machine, IOPS are pretty fast so
> you can saturate the CPU more easily than some other case.
>
> For map-reduce on multiple machines, the IOPS/FLOPS ratio becomes evil and
> keeping the CPU loaded becomes much harder.   In general, it is hard to
> transfer even 50MBytes/s into a single mapper which is only 5M floating
> point numbers per second.  At a perfect 40K FLOPS per number, this is only
> 200MFlops per CPU (with several cores!!).  Well tuned dense multiply
> routines should crank out about 2-3GFlops per core.  Thus, for an 8 core
> machine, we have a big deficiency to overcome before Hadoop style map-reduce
> is practical for this.
>
> For lots of algorithms, this gets even worse.  The convergence rate of
> on-line gradient suffers badly with any kind of delay between computation of
> the gradient and its application to the model.  Batching is a kind of
> delay.  Moreover, a single CPU is often entirely sufficient to saturate I/O
> devices.  This means that parallelizing the algorithm gives almost no wall
> clock convergence speedup at all even though all CPU's might be very, very
> busy.
>
> On Sun, Apr 11, 2010 at 9:27 PM, Steven Buss <steven.b...@gmail.com> wrote:
>
>> If you're just doing matrix multiplication, I would advise that mahout
>> (or any mapreduce approach) isn't well suited to your problem. I did
>> the same computation with matlab (multiplying two 40k x 40k random
>> double precision dense matrices) using 14 cores and about 36GB of ram
>> on a single machine* and it finished in about 55 minutes. If I'm
>> reading your email correctly, you were working with 34*2*4=272 cores!
>> I'm not sure if dense matrix multiplication can actually be
>> efficiently mapreduced, but I am still a rookie so don't take my word
>> for it.
>>
>> *The machine I am working on has 8 dual core AMD opteron 875s @ 2.2GHz
>> per core, with 64GB total system memory.
>>
>> Steven Buss
>> steven.b...@gmail.com
>> http://www.stevenbuss.com/
>>
>>
>>
>> On Sun, Apr 11, 2010 at 11:53 PM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>> > Vimal,
>> >
>> > We don't have any distributed dense multiplication operations because we
>> > have not yet found much application demand for distributed dense matrix
>> > multiplication.  Distributed sparse matrix operations are a big deal,
>> > however.
>> >
>> > If you are interested in working on the problem in the context of Mahout,
>> we
>> > would love to help.  This is especially true if you have an application
>> that
>> > needs dense operations and could benefit from some of the other
>> capabilities
>> > in Mahout.
>> >
>> > On Sun, Apr 11, 2010 at 1:27 PM, Vimal Mathew <vml.mat...@gmail.com>
>> wrote:
>> >
>> >> Hi,
>> >>  What's the current state of matrix-matrix multiplication in Mahout?
>> >> Are there any performance results available for large matrices?
>> >>
>> >>  I have been working on a Hadoop-compatible distributed storage for
>> >> matrices. I can currently multiply two 40K x 40K dense double
>> >> precision matrices in around 1 hour using 34 systems (16GB RAM, two
>> >> Core2Quads' per node). I was wondering how this compares with Mahout.
>> >>
>> >> Regards,
>> >>  Vimal
>> >>
>> >
>>
>

Re: Current state of (dense) matrix multiplication?

Reply via email to