Re: graph data representation for mapreduce

Ted Dunning Sat, 02 Feb 2008 09:31:47 -0800

On reading what I sent here, I think maybe some amplification is in order.

The problem with problems that are arranged like matrix multiplication is
that matrix-matrix multiplication requires O(n^3) operations on O(n^2) data
elements.  If coded naively, you require O(n^3) data loads and stores which
is very bad since on modern architectures arithmetic operations are much
faster than memory operations.  With hadoop and map-reduce in general, this
problem is much, much worse.

The answer is to hold some block of data in a fast level of storage and to
perform multiple operations on each data element before letting it go.  For
conventional main memory architectures, it is common to use one or two
levels of blocking, one so that registers get re-used and another so that L1
cache gets re-used effectively.  Doing this well can mean 10 or even 100x
improvement in performance.  Similar, but not quite so impressive gains are
possible with matrix-vector products.

This issue is reflected ubiquitously in high quality numerical linear
algebra codes.  The emphasis is always on trying to avoid element by element
access, replacing them by what are called level 3 blas operations which are
basically just in-place updates using matrix-matrix multiplication as a
primitive operation.  The importance of in-place update and level 3
decomposition is largely what drove Colt to supporting views and slices so
carefully. 

With Map-reduce, these problems are much more serious since we now have a
data channel that is slower yet again.  The answer is that we have to
structure code in such a way as to move large sub-matrices around as single
data elements and operate on them using high level operations.

In my own graph-based algorithms, I use something shaped like matrix
transpose by matrix multiplication and I basically join columns of the left
product to copies of the right product.  This results in n times more data
traffic than I would like, but it allows the mappers to engage in some
serious bites of work and allows me to get pretty decent throughput.

On 2/1/08 3:48 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote:

> 
> In my work, I am finding that sending around entire rows or columns of the
> adjacency graph gives substantial gains as does block decomposition of some of
> the algorithms involved.
> 
> 
> On 2/1/08 2:51 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:
> 
>> some of our biggest map reduce jobs have been graph related (not shortest
>> path 
>> though).
>> 
>> map-reduce doesn't seem like the best computation platform for some of the
>> jobs we have had to do. Even a huge graph does not require unheard amounts of
>> memory to store as an adjacency list. but mapping (at least some) graph
>> algorithms to map-reduces causes intermediate data to bloat to enormous
>> sizes.
>> 
>> to that end we are moving away from pure map-reduce to hybrid models that
>> work 
>> in tandem with large memory banks caching the graph. The trend towards cheap
>> flash storage is a helpful factor (one that we haven't exploited yet though).
>> 
>> 
>> 
>> -----Original Message-----
>> From: Peter W. [mailto:[EMAIL PROTECTED]
>> Sent: Fri 2/1/2008 1:14 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: graph data representation for mapreduce
>>  
>> Cam,
>> 
>> Making a directed graph in Hadoop is not
>> very difficult but traversing live might be
>> since the result is a separate file.
>> 
>> Basically, you kick out a destination node
>> as your key in the mapper and from nodes as
>> intermediate values. Concatenate from values in
>> the reducer assigning weights for each edge.
>> 
>> Assigned edge scores come from a computation
>> done in the reducer or number passed by key.
>> 
>> This gives a simple but weighted from/to
>> depiction and can be experimented with and
>> improved by subsequent passes or REST style
>> calls in the mapper for mysqldb weights.
>> 
>> Later,
>> 
>> Peter W.
>> 
>> Cam Bazz wrote:
>> 
>>> Hello,
>>> 
>>> I have been long interested in storing graphs, in databases, object
>>> databases and lucene like indexes.
>>> ....
>>> 
>>> Has anyone done any work on storing and processing graphs with map
>>> reduce?
>>> If I were to start, where would I start from. I am interested in
>>> finding
>>> shortest paths in a large graph.
>> 
>>

Re: graph data representation for mapreduce

Reply via email to