Hi Ted, Apologies I couldn't join the discussion earlier. Here's how we use the matrix APIs in Carrot2:
1. We use only 2D matrices and vectors. 2. We mainly use dense matrices (to implement decompositions), we also use the sparse ones occasionally. 3. We use the SVD, though it's not the default in our algorithms, so we could live without it for some time I suppose. If SVD is used, we depend on the order of singular vectors, but we can always sort the decomposition result afterwards based on the singular values. 4. The reason I chose Colt's API over the other ones was that it enabled writing code that minimizes the allocation of temporary matrices and copying of data. A few examples: a) the zMult() method can optionally store the result of the multiplication in one of the arguments instead of allocating a clean new matrix each time. This is especially useful in our iterative matrix decomposition routines, such as this one: https://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-util-matrix/src/org/carrot2/matrix/factorization/NonnegativeMatrixFactorizationED.java?hb=true#to37. Plus, you can tell zMult() to transpose and multiply by constant during multiplication, which is also handy, but not crucial. b) the viewDice(), viewColumn(), viewRow() and possibly some other views don't involve any copying of matrix data, they simply modify the way the raw data is accessed (see e.g. AbstractMatrix2D.vDice()), potentially saving time and garbage collection. c) data representation and the zMult() method is similar to the GEMM functions from BLAS implementations, which made it fairly easy to implement a dense matrix with native platform-specific multiplication routines. Even with Colt's optimized dense matrix multiplication, the native versions were way faster (up to 10x when backed by ATLAS). We also use some less significant bits of Colt API like sorting, but these are not crucial and can be implemented "by hand" where needed. I'm not sure to what extent the performance and memory footprint would change if we switched to Mahout matrices in their current shape though. Later next week I can reimplement one of our factorizations using Mahout matrices and run some benchmarks. Based on that we can determine where the bottlenecks are. S. On Thu, Feb 3, 2011 at 22:52, Ted Dunning <[email protected]> wrote: > Actually, I haven't liked the double[][] implementation ever. > > Its great virtue is that it exists. > > > On Thu, Feb 3, 2011 at 1:42 PM, Dawid Weiss > <[email protected]>wrote: > >> >> private double[][] values; >> > No, but that isn't all that hard to fix. >> >> Oh, I just thought you guys wanted it to be this way... But if not, >> then a double[] is probably a better choice anyway (think potential >> matrix data fragmentation if not anything else). >> > >
