Re: CNB: Learning from Huge Datasets

Robin Anil Fri, 11 Jul 2008 23:18:47 -0700

Just an update. I don't know whether I read the paper wrong. I hope the the
author(Jason Rennie) sees this message.


instead of calculating for all the words in all labels, In the complementary
class. I kept only the weights of the words that occur in the
Non-Complementary class. Rest words I ignored(or weight is zero)

The self classification accuracy on the 20Newsgroups jumped from 98.2 to
99.87. And it solved the dense matrix problem also

Robin


On Sat, Jul 12, 2008 at 9:40 AM, Robin Anil <[EMAIL PROTECTED]> wrote:

>
>
> On Sat, Jul 12, 2008 at 5:43 AM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
>
>> If the matrix is not sparse, then you may have a problem implementing it
>> efficiently at scale.  Non-sparse algorithms have a distressing tendency
>> to
>> scale quadratically with class x features.  Remember that sparsity can be
>> relative to an offset so that if a particular count is common for a class,
>> feature compound, you can carry along that count as a default.
>
>
> I am putting the formulation here. If anyone gets an idea please post
>
> If there are K labels k=1,2...K and J features j=1,2,.....J
> if w_jk is the weight of feature_j in label k
> Sigma_j is the sum of w_jk for all K(sum of weight of a feature j in all
> labels)
> Sigma_k is the sum of w_jk for all J(sum of weight of all the features in a
> label k)
> Sigma_j_Sigma_k is the sum of weights of all features in all labels.
>
> CNB formulation is   weight = (Sigma_j - w_jk + 1 )/ ( Sigma_k_Sigma_j -
> Sigma_k + J)
>
> I have the following data with me.
> Sigma_j for all j's
> Sigma_k for all k's
> Sigma_k_Sigma_j
> w_jk for all label=k,feature=j pair
> *
> My Current method of calculating the weight.*
> Since Sigma_k, Sigma_kSigma_j and J are very small amount of data. They are
> passed in the job conf to all the Mappers and Reducers
>
> Input to the Mapper: Sigma_j's and w_jk's
> Map:
>           If the input is from Sigma_j (i have a way of differentiating
> them) then output K values for each k output ["j,k" = > (Sigma_j +
> 1.0)/(Sigma_k_Sigma_j - Sigma_k + J)]
>           If the input is from w_jk then output ["j,k" = > (-1
> /(Sigma_k_Sigma_j - Sigma_k + J) ]
>
> Reduce:
>           Add up all the values so we get ["j,k" = > (Sigma_j + 1.0 - w_jk)
> /(Sigma_k_Sigma_j - Sigma_k + J)]
>
>
>
>
>
>>
>>
>> Another possible approach to sparsity is to represent your matrix as a sum
>> of rank-1 components plus a highly sparse residual term.  I can't say
>> without thinking whether you can do this.  Conceptually speaking, however,
>> your data almost has to have some redundant representation because your
>> original data is too small to encode a high complexity matrix.
>>
>> My next thought is that if you absolutely must operate on the dense
>> matrix,
>> then you need to phrase the problem in matrix terms and look for higher
>> order operations.  This can help scaling problems, but not cure them.
>>
>> The basic idea is that naive implementation of matrix-multiply requires
>> n^3
>> arithmetic operations on roughly n^2 elements.  If you do this poorly,
>> then
>> you have n^3 data communication operations which may kill your algorithm.
>> If you order the operations well, then you can have roughly n^2
>> communication operations which scales better than the n^3 arithmetic
>> operations and given the speed of modern processors relative to network
>> speeds, will allow network bounded scaling up to n \approx 10^6 or so.
>> Beyond that, you will be arithmetic bound.
>>
>> Note that matrix-multiply here is any operation that has the same "shape"
>> as
>> traditional matrix multiplications.
>>
>> Cant say i have the plan from what you said. Will have to read up on good
> ordering techniques to see if it can be scaled up.
>
>>
>>
>>
>> On Fri, Jul 11, 2008 at 4:59 PM, Robin Anil <[EMAIL PROTECTED]> wrote:
>>
>> >
>> > So whats troubling is that we need to create a matrix of that scale
>> which
>> > can be accessible and updatable easily during map/reduce. Currently the
>> > matrix is generated as cell_id = > value format in each stage of map
>> > reduce.
>> > Esentially each Job is iterating over the whole matix.
>> >
>> >
>> >
>>
>
>
>
> --
> Robin Anil
> Senior Dual Degree Student
> Department of Computer Science & Engineering
> IIT Kharagpur
>
>
> --------------------------------------------------------------------------------------------
> techdigger.wordpress.com
> A discursive take on the world around us
>
> www.minekey.com
> You Might Like This
>
> www.ithink.com
> Express Yourself
>



-- 
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: CNB: Learning from Huge Datasets

Reply via email to