Re: Mahout Math and Carrot2

Stanislaw Osinski Fri, 04 Feb 2011 07:46:27 -0800

Hi Ted,

Apologies I couldn't join the discussion earlier. Here's how we use the
matrix APIs in Carrot2:

1. We use only 2D matrices and vectors.

2. We mainly use dense matrices (to implement decompositions), we also use
the sparse ones occasionally.

3. We use the SVD, though it's not the default in our algorithms, so we
could live without it for some time I suppose. If SVD is used, we depend on
the order of singular vectors, but we can always sort the decomposition
result afterwards based on the singular values.

4. The reason I chose Colt's API over the other ones was that it enabled
writing code that minimizes the allocation of temporary matrices and copying
of data. A few examples:

a) the zMult() method can optionally store the result of the multiplication
in one of the arguments instead of allocating a clean new matrix each time.
This is especially useful in our iterative matrix decomposition routines,
such as this one:
https://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-util-matrix/src/org/carrot2/matrix/factorization/NonnegativeMatrixFactorizationED.java?hb=true#to37.
Plus, you can tell zMult() to transpose and multiply by constant during
multiplication, which is also handy, but not crucial.

b) the viewDice(), viewColumn(), viewRow() and possibly some other views
don't involve any copying of matrix data, they simply modify the way the raw
data is accessed (see e.g. AbstractMatrix2D.vDice()), potentially saving
time and garbage collection.

c) data representation and the zMult() method is similar to the GEMM
functions from BLAS implementations, which made it fairly easy to implement
a dense matrix with native platform-specific multiplication routines. Even
with Colt's optimized dense matrix multiplication, the native versions were
way faster (up to 10x when backed by ATLAS).

We also use some less significant bits of Colt API like sorting, but these
are not crucial and can be implemented "by hand" where needed.

I'm not sure to what extent the performance and memory footprint would
change if we switched to Mahout matrices in their current shape though.
Later next week I can reimplement one of our factorizations using Mahout
matrices and run some benchmarks. Based on that we can determine where the
bottlenecks are.

S.

On Thu, Feb 3, 2011 at 22:52, Ted Dunning <[email protected]> wrote:

> Actually, I haven't liked the double[][] implementation ever.
>
> Its great virtue is that it exists.
>
>
> On Thu, Feb 3, 2011 at 1:42 PM, Dawid Weiss 
> <[email protected]>wrote:
>
>> >>  private double[][] values;
>> > No, but that isn't all that hard to fix.
>>
>> Oh, I just thought you guys wanted it to be this way... But if not,
>> then a double[] is probably a better choice anyway (think potential
>> matrix data fragmentation if not anything else).
>>
>
>

Re: Mahout Math and Carrot2

Reply via email to