(Oops. I meant to sen to the list. )


The rule-of-thumb computational times dynanics i  am seeing on our
branch of ssvd is as follows

1G of gz-compressed (i.e.highly compressed) sequence file worth output
would compute in 5-7 minutes on 8-12 4 core nodes (Assuming 500
singular values including oversampling, k+p)

64G worth of compressed output is thought to take <=10 minutes still
provided there's enough coverage in the cluster to pick up on 1000 map
tasks without delay (i.e. perhaps 250 4 core noder is required).

It is possible Mahout's branch will require more memory and will run
slower in low-memory situations because it allocates and relinquishes
big chunks of young memory to load every row of the input (if 30m is
thought to be the dense width of the matrix, it is about 500 Mb of
memory per row to alocate and throw away, that will probably spawn a
lot of YGC activity, and, depending on occupied RAM ratio, likely full
GCs as well, so it may be quite slow).

Also 30m wide vectors will certainly require -minSplitSize to be
bumped A LOT. For k+p=500 it would be 500Mb*500* compression ratio ~=
250G A LOT OF I/O to run the mappers. There's a version in the works
to circumvent this a bit but it's not in Mahout (yet).

(So for 30mx30m 100% dense matrix in double airthmetic total input
must be a really huge file btw, about up to 15Tb or so?)



On Wed, Apr 6, 2011 at 9:43 AM, Ted Dunning <[email protected]> wrote:
> If you did mean 60 million by 60 million, is that matrix sparse?
>
> Also, how many eigenvectors did you ask for?
> How large is your machine in terms of memory?
> You might also experiment with the random projection version of SVD.
>  Dmitriy can comment on
> how to run that.

Reply via email to