(Oops. I meant to sen to the list. )
The rule-of-thumb computational times dynanics i am seeing on our branch of ssvd is as follows 1G of gz-compressed (i.e.highly compressed) sequence file worth output would compute in 5-7 minutes on 8-12 4 core nodes (Assuming 500 singular values including oversampling, k+p) 64G worth of compressed output is thought to take <=10 minutes still provided there's enough coverage in the cluster to pick up on 1000 map tasks without delay (i.e. perhaps 250 4 core noder is required). It is possible Mahout's branch will require more memory and will run slower in low-memory situations because it allocates and relinquishes big chunks of young memory to load every row of the input (if 30m is thought to be the dense width of the matrix, it is about 500 Mb of memory per row to alocate and throw away, that will probably spawn a lot of YGC activity, and, depending on occupied RAM ratio, likely full GCs as well, so it may be quite slow). Also 30m wide vectors will certainly require -minSplitSize to be bumped A LOT. For k+p=500 it would be 500Mb*500* compression ratio ~= 250G A LOT OF I/O to run the mappers. There's a version in the works to circumvent this a bit but it's not in Mahout (yet). (So for 30mx30m 100% dense matrix in double airthmetic total input must be a really huge file btw, about up to 15Tb or so?) On Wed, Apr 6, 2011 at 9:43 AM, Ted Dunning <[email protected]> wrote: > If you did mean 60 million by 60 million, is that matrix sparse? > > Also, how many eigenvectors did you ask for? > How large is your machine in terms of memory? > You might also experiment with the random projection version of SVD. > Dmitriy can comment on > how to run that.
