Hey all, It's been a while since I've been all that active, but I've been trying to run some of the matrix stuff against really monstrous matrices (many of you are aware that I switched jobs recently, and my new gig has some pretty ridiculous data sets), and I'm also trying to get things to run in a maximally scalable way: in particular, running on large Hadoop clusters, but without the luxury of increasing the mapper/reducer heaps, and without the luxury of a ton of memory on the machine I'm driving the mahout jobs from.
What this means is I've been running into little things like dependencies on O(numColumns) memory, or forgetting to override and make sure you're using a bunch of reducers, or having the ability to chose several different implementations (or better: not having to chose), such as if you matrix multiply two matrices, one of them might be small enough to fit in memory, or alternately, you're multiplying a matrix by a really really big vector, and even the vector itself doesn't fit in memory. So I've been playing with a few different ways to approach some of these issues, and I thought I'd bring them up here for any thoughts: - The distributed SVD code currently stores, in memory (on the "driving" computer), desiredRank Vectors (each of size numCols of the input matrix) as a basis, to be used for eigen-generation. Then, the eigenvectors (all desiredRank of them) are computed as linear combinations of this basis (essentially finalEigens = lanczosEigens.times(basis), where lanczosEigens are the eigenvectors of the small (desiredRank x desiredRank) tri-diagonal auxiliary matrix generated during power iteration). Net result: the current algorithm needs actually (desiredRank * numCols * 16)bytes available on the driving machine, and since Lanczos currently grabs both large eigenvalue eigenvectors, and small ones, and the user typically wants only one or the other, they're typically going to need to ask for 2-3x as high a desiredRank as the truly desire. So for 100K columns, a rank 250 SVD is currently taking about 1GB, and 1M columns, rank 250 is up to 10GB. A bit hefty for a "scalable" algorithm. - Even once the above is taken care of (by paging vectors off to local disk), the issue of dealing with some DenseVector instances with hundreds of millions of entries comes up sometimes. 250M doubles is 4GB for a single vector. Hanging onto even one of these in a Reducer is unacceptably unscalable when the dimensionality gets up that high. Backing a DenseVector with a SequenceFile<IntWritable,DoubleWritable> might be the right way to go, and would allow for a variety of distributed computations with vectors of this monstrous size, and would be truly scalable, although many algorithms which could currently be done in one MR pass (or "k") would take two (or 2*k), by needing reduce-side joins interleaved in their main activities. One question I'm wrestling with is whether we should have disk-backed (local or HDFS) vectors and matrices transparently implement Vector and Matrix, or whether this will lead to confusion and trick people into treating them in the same way they think of in-memory versions (ie think they can do lots of random access operations which just thrash and aren't performant). Any thoughts? -jake
