Re: CNB: Learning from Huge Datasets

Ted Dunning Fri, 11 Jul 2008 17:13:48 -0700

If the matrix is not sparse, then you may have a problem implementing it
efficiently at scale.  Non-sparse algorithms have a distressing tendency to
scale quadratically with class x features.  Remember that sparsity can be
relative to an offset so that if a particular count is common for a class,
feature compound, you can carry along that count as a default.

Another possible approach to sparsity is to represent your matrix as a sum
of rank-1 components plus a highly sparse residual term.  I can't say
without thinking whether you can do this.  Conceptually speaking, however,
your data almost has to have some redundant representation because your
original data is too small to encode a high complexity matrix.

My next thought is that if you absolutely must operate on the dense matrix,
then you need to phrase the problem in matrix terms and look for higher
order operations.  This can help scaling problems, but not cure them.

The basic idea is that naive implementation of matrix-multiply requires n^3
arithmetic operations on roughly n^2 elements.  If you do this poorly, then
you have n^3 data communication operations which may kill your algorithm.
If you order the operations well, then you can have roughly n^2
communication operations which scales better than the n^3 arithmetic
operations and given the speed of modern processors relative to network
speeds, will allow network bounded scaling up to n \approx 10^6 or so.
Beyond that, you will be arithmetic bound.

Note that matrix-multiply here is any operation that has the same "shape" as
traditional matrix multiplications.

On Fri, Jul 11, 2008 at 4:59 PM, Robin Anil <[EMAIL PROTECTED]> wrote:

>
> So whats troubling is that we need to create a matrix of that scale which
> can be accessible and updatable easily during map/reduce. Currently the
> matrix is generated as cell_id = > value format in each stage of map
> reduce.
> Esentially each Job is iterating over the whole matix.
>
>
>

Re: CNB: Learning from Huge Datasets

Reply via email to