Ted Dunning
Wed, 11 Jun 2008 11:00:43 -0700
The more that I look at problems that include matrix multiplication as a core element, the more that I think that large tuples are a bad way to encode these matrices. I am assuming very sparse matrices here which is definitely the norm in what I do. Very large dense matrices would probably require different approaches. To make what I suggest, here is an example. Suppose that we have two sparse matrices A and B and we want to A' A and A' B. This is useful in coocurrence analysis, for instance. One possible encoding for a matrix is to have a variable sized tuple that starts with a row number and contains a sequence of column number, value pairs. Another encoding which appears less efficient at first glance is to have triples with row number, column number and value. Matrix multiplication of the second form can be written (roughly) as the result of co-grouping A with A on rows or A on columns with B on rows and emitting the product of corresponding elements. These output elements are then grouped on index pairs and summed. In the A'A case, you can probably do with one less group statement. This seems really inefficient compared to passing entire rows around because the number of records is so much larger. But it turns out that it is really pretty fast. A comparable problem was worked on by Chris Dyer at University of Maryland who go almost perfect speedup relative to a hand-coded uni-processor C program by using this kind of approach. Most importantly, this approach completely side-steps the problem of large tuples. On Tue, Jun 10, 2008 at 8:42 PM, pi song <[EMAIL PROTECTED]> wrote: > ... basically). I was thinking about matrix processing in Pig before but it > seemed to be difficult due to this fact. > >