pig-user  

Re: Large tuples

Ted Dunning
Wed, 11 Jun 2008 11:00:43 -0700

The more that I look at problems that include matrix multiplication as a
core element, the more that I think that large tuples are a bad way to
encode these matrices.

I am assuming very sparse matrices here which is definitely the norm in what
I do.  Very large dense matrices would probably require different
approaches.

To make what I suggest, here is an example.  Suppose that we have two sparse
matrices A and B and we want to A' A and A' B.  This is useful in
coocurrence analysis, for instance.

One possible encoding for a matrix is to have a variable sized tuple that
starts with a row number and contains a sequence of column number, value
pairs.  Another encoding which appears less efficient at first glance is to
have triples with row number, column number and value.

Matrix multiplication of the second form can be written (roughly) as the
result of co-grouping A with A on rows or A on columns with B on rows and
emitting the product of corresponding elements.  These output elements are
then grouped on index pairs and summed.  In the A'A case, you can probably
do with one less group statement.

This seems really inefficient compared to passing entire rows around because
the number of records is so much larger.  But it turns out that it is really
pretty fast.  A comparable problem was worked on by Chris Dyer at University
of Maryland who go almost perfect speedup relative to a hand-coded
uni-processor C program by using this kind of approach.

Most importantly, this approach completely side-steps the problem of large
tuples.

On Tue, Jun 10, 2008 at 8:42 PM, pi song <[EMAIL PROTECTED]> wrote:

> ... basically). I was thinking about matrix processing in Pig before but it
> seemed to be difficult due to this fact.
>
>