Re: CNB: Learning from Huge Datasets

Robin Anil Fri, 11 Jul 2008 21:11:22 -0700

On Sat, Jul 12, 2008 at 5:43 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:


> If the matrix is not sparse, then you may have a problem implementing it
> efficiently at scale.  Non-sparse algorithms have a distressing tendency to
> scale quadratically with class x features.  Remember that sparsity can be
> relative to an offset so that if a particular count is common for a class,
> feature compound, you can carry along that count as a default.


I am putting the formulation here. If anyone gets an idea please post

If there are K labels k=1,2...K and J features j=1,2,.....J
if w_jk is the weight of feature_j in label k
Sigma_j is the sum of w_jk for all K(sum of weight of a feature j in all
labels)
Sigma_k is the sum of w_jk for all J(sum of weight of all the features in a
label k)
Sigma_j_Sigma_k is the sum of weights of all features in all labels.

CNB formulation is   weight = (Sigma_j - w_jk + 1 )/ ( Sigma_k_Sigma_j -
Sigma_k + J)

I have the following data with me.
Sigma_j for all j's
Sigma_k for all k's
Sigma_k_Sigma_j
w_jk for all label=k,feature=j pair
*
My Current method of calculating the weight.*
Since Sigma_k, Sigma_kSigma_j and J are very small amount of data. They are
passed in the job conf to all the Mappers and Reducers

Input to the Mapper: Sigma_j's and w_jk's
Map:
          If the input is from Sigma_j (i have a way of differentiating
them) then output K values for each k output ["j,k" = > (Sigma_j +
1.0)/(Sigma_k_Sigma_j - Sigma_k + J)]
          If the input is from w_jk then output ["j,k" = > (-1
/(Sigma_k_Sigma_j - Sigma_k + J) ]

Reduce:
          Add up all the values so we get ["j,k" = > (Sigma_j + 1.0 - w_jk)
/(Sigma_k_Sigma_j - Sigma_k + J)]





>
>
> Another possible approach to sparsity is to represent your matrix as a sum
> of rank-1 components plus a highly sparse residual term.  I can't say
> without thinking whether you can do this.  Conceptually speaking, however,
> your data almost has to have some redundant representation because your
> original data is too small to encode a high complexity matrix.
>
> My next thought is that if you absolutely must operate on the dense matrix,
> then you need to phrase the problem in matrix terms and look for higher
> order operations.  This can help scaling problems, but not cure them.
>
> The basic idea is that naive implementation of matrix-multiply requires n^3
> arithmetic operations on roughly n^2 elements.  If you do this poorly, then
> you have n^3 data communication operations which may kill your algorithm.
> If you order the operations well, then you can have roughly n^2
> communication operations which scales better than the n^3 arithmetic
> operations and given the speed of modern processors relative to network
> speeds, will allow network bounded scaling up to n \approx 10^6 or so.
> Beyond that, you will be arithmetic bound.
>
> Note that matrix-multiply here is any operation that has the same "shape"
> as
> traditional matrix multiplications.
>
> Cant say i have the plan from what you said. Will have to read up on good
ordering techniques to see if it can be scaled up.

>
>
>
> On Fri, Jul 11, 2008 at 4:59 PM, Robin Anil <[EMAIL PROTECTED]> wrote:
>
> >
> > So whats troubling is that we need to create a matrix of that scale which
> > can be accessible and updatable easily during map/reduce. Currently the
> > matrix is generated as cell_id = > value format in each stage of map
> > reduce.
> > Esentially each Job is iterating over the whole matix.
> >
> >
> >
>



-- 
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: CNB: Learning from Huge Datasets

Reply via email to