On reading what I sent here, I think maybe some amplification is in order.
The problem with problems that are arranged like matrix multiplication is that matrix-matrix multiplication requires O(n^3) operations on O(n^2) data elements. If coded naively, you require O(n^3) data loads and stores which is very bad since on modern architectures arithmetic operations are much faster than memory operations. With hadoop and map-reduce in general, this problem is much, much worse. The answer is to hold some block of data in a fast level of storage and to perform multiple operations on each data element before letting it go. For conventional main memory architectures, it is common to use one or two levels of blocking, one so that registers get re-used and another so that L1 cache gets re-used effectively. Doing this well can mean 10 or even 100x improvement in performance. Similar, but not quite so impressive gains are possible with matrix-vector products. This issue is reflected ubiquitously in high quality numerical linear algebra codes. The emphasis is always on trying to avoid element by element access, replacing them by what are called level 3 blas operations which are basically just in-place updates using matrix-matrix multiplication as a primitive operation. The importance of in-place update and level 3 decomposition is largely what drove Colt to supporting views and slices so carefully. With Map-reduce, these problems are much more serious since we now have a data channel that is slower yet again. The answer is that we have to structure code in such a way as to move large sub-matrices around as single data elements and operate on them using high level operations. In my own graph-based algorithms, I use something shaped like matrix transpose by matrix multiplication and I basically join columns of the left product to copies of the right product. This results in n times more data traffic than I would like, but it allows the mappers to engage in some serious bites of work and allows me to get pretty decent throughput. On 2/1/08 3:48 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: > > In my work, I am finding that sending around entire rows or columns of the > adjacency graph gives substantial gains as does block decomposition of some of > the algorithms involved. > > > On 2/1/08 2:51 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > >> some of our biggest map reduce jobs have been graph related (not shortest >> path >> though). >> >> map-reduce doesn't seem like the best computation platform for some of the >> jobs we have had to do. Even a huge graph does not require unheard amounts of >> memory to store as an adjacency list. but mapping (at least some) graph >> algorithms to map-reduces causes intermediate data to bloat to enormous >> sizes. >> >> to that end we are moving away from pure map-reduce to hybrid models that >> work >> in tandem with large memory banks caching the graph. The trend towards cheap >> flash storage is a helpful factor (one that we haven't exploited yet though). >> >> >> >> -----Original Message----- >> From: Peter W. [mailto:[EMAIL PROTECTED] >> Sent: Fri 2/1/2008 1:14 PM >> To: [EMAIL PROTECTED] >> Subject: Re: graph data representation for mapreduce >> >> Cam, >> >> Making a directed graph in Hadoop is not >> very difficult but traversing live might be >> since the result is a separate file. >> >> Basically, you kick out a destination node >> as your key in the mapper and from nodes as >> intermediate values. Concatenate from values in >> the reducer assigning weights for each edge. >> >> Assigned edge scores come from a computation >> done in the reducer or number passed by key. >> >> This gives a simple but weighted from/to >> depiction and can be experimented with and >> improved by subsequent passes or REST style >> calls in the mapper for mysqldb weights. >> >> Later, >> >> Peter W. >> >> Cam Bazz wrote: >> >>> Hello, >>> >>> I have been long interested in storing graphs, in databases, object >>> databases and lucene like indexes. >>> .... >>> >>> Has anyone done any work on storing and processing graphs with map >>> reduce? >>> If I were to start, where would I start from. I am interested in >>> finding >>> shortest paths in a large graph. >> >>
