Hey all,

  It's been a while since I've been all that active, but I've been trying to
run some of the matrix stuff against really monstrous matrices (many of you
are aware that I switched jobs recently, and my new gig has some pretty
ridiculous data sets), and I'm also trying to get things to run in a
maximally scalable way: in particular, running on large Hadoop clusters, but
without the luxury of increasing the mapper/reducer heaps, and without the
luxury of a ton of memory on the machine I'm driving the mahout jobs from.

  What this means is I've been running into little things like dependencies
on O(numColumns) memory, or forgetting to override and make sure you're
using a bunch of reducers, or having the ability to chose several different
implementations (or better: not having to chose), such as if you matrix
multiply two matrices, one of them might be small enough to fit in memory,
or alternately, you're multiplying a matrix by a really really big vector,
and even the vector itself doesn't fit in memory.

  So I've been playing with a few different ways to approach some of these
issues, and I thought I'd bring them up here for any thoughts:


   -   The distributed SVD code currently stores, in memory (on the
   "driving" computer), desiredRank Vectors (each of size numCols of the input
   matrix) as a basis, to be used for eigen-generation.  Then, the eigenvectors
   (all desiredRank of them) are computed as linear combinations of this basis
   (essentially finalEigens = lanczosEigens.times(basis), where lanczosEigens
   are the eigenvectors of the small (desiredRank x desiredRank) tri-diagonal
   auxiliary matrix generated during power iteration).  Net result: the current
   algorithm needs actually (desiredRank * numCols * 16)bytes available on the
   driving machine, and since Lanczos currently grabs both large eigenvalue
   eigenvectors, and small ones, and the user typically wants only one or the
   other, they're typically going to need to ask for 2-3x as high a desiredRank
   as the truly desire.  So for 100K columns, a rank 250 SVD is currently
   taking about 1GB, and 1M columns, rank 250 is up to 10GB.  A bit hefty for a
   "scalable" algorithm.
   - Even once the above is taken care of (by paging vectors off to local
   disk), the issue of dealing with some DenseVector instances with hundreds of
   millions of entries comes up sometimes.  250M doubles is 4GB for a single
   vector.  Hanging onto even one of these in a Reducer is unacceptably
   unscalable when the dimensionality gets up that high.  Backing a DenseVector
   with a SequenceFile<IntWritable,DoubleWritable> might be the right way to
   go, and would allow for a variety of distributed computations with vectors
   of this monstrous size, and would be truly scalable, although many
   algorithms which could currently be done in one MR pass (or "k") would take
   two (or 2*k), by needing reduce-side joins interleaved in their main
   activities.

  One question I'm wrestling with is whether we should have disk-backed
(local or HDFS) vectors and matrices transparently implement Vector and
Matrix, or whether this will lead to confusion and trick people into
treating them in the same way they think of in-memory versions (ie think
they can do lots of random access operations which just thrash and aren't
performant).

  Any thoughts?

  -jake

Reply via email to