PS the numbers are for CDH3b3 or b4 barebone cluster with some
optimization settings (jvm sharing=10tasks, for example).

Amazon EC2 would of course be slower both because 0.20.2 is slow to
allocate tasks and because I think Amazon allocates nodes
intentionally on different subnets and because unless you are using
double extra large instance, you are going to share all IO with
co-tenants' VMs which usually kills all disk performance. I verified
with Amazon's architects that (at least in West datacenter) double
extra large memory instance (the one with 32G of RAM) is the only one
that most of the time would guarantee no co-tenant I/O. (Ironically,
you probably don't need those instances for what they touted for, you
might get away with much smaller instances if they weren't
co-inhabited, but you can't)


On Wed, Apr 6, 2011 at 11:02 AM, Dmitriy Lyubimov <[email protected]> wrote:
> (Oops. I meant to sen to the list. )
>
>
>
> The rule-of-thumb computational times dynanics i  am seeing on our
> branch of ssvd is as follows
>
> 1G of gz-compressed (i.e.highly compressed) sequence file worth output
> would compute in 5-7 minutes on 8-12 4 core nodes (Assuming 500
> singular values including oversampling, k+p)
>
> 64G worth of compressed output is thought to take <=10 minutes still
> provided there's enough coverage in the cluster to pick up on 1000 map
> tasks without delay (i.e. perhaps 250 4 core noder is required).
>
> It is possible Mahout's branch will require more memory and will run
> slower in low-memory situations because it allocates and relinquishes
> big chunks of young memory to load every row of the input (if 30m is
> thought to be the dense width of the matrix, it is about 500 Mb of
> memory per row to alocate and throw away, that will probably spawn a
> lot of YGC activity, and, depending on occupied RAM ratio, likely full
> GCs as well, so it may be quite slow).
>
> Also 30m wide vectors will certainly require -minSplitSize to be
> bumped A LOT. For k+p=500 it would be 500Mb*500* compression ratio ~=
> 250G A LOT OF I/O to run the mappers. There's a version in the works
> to circumvent this a bit but it's not in Mahout (yet).
>
> (So for 30mx30m 100% dense matrix in double airthmetic total input
> must be a really huge file btw, about up to 15Tb or so?)
>
>
>
> On Wed, Apr 6, 2011 at 9:43 AM, Ted Dunning <[email protected]> wrote:
>> If you did mean 60 million by 60 million, is that matrix sparse?
>>
>> Also, how many eigenvectors did you ask for?
>> How large is your machine in terms of memory?
>> You might also experiment with the random projection version of SVD.
>>  Dmitriy can comment on
>> how to run that.
>

Reply via email to