Just an update. I don't know whether I read the paper wrong. I hope the the author(Jason Rennie) sees this message.
instead of calculating for all the words in all labels, In the complementary class. I kept only the weights of the words that occur in the Non-Complementary class. Rest words I ignored(or weight is zero) The self classification accuracy on the 20Newsgroups jumped from 98.2 to 99.87. And it solved the dense matrix problem also Robin On Sat, Jul 12, 2008 at 9:40 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > > > On Sat, Jul 12, 2008 at 5:43 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > >> If the matrix is not sparse, then you may have a problem implementing it >> efficiently at scale. Non-sparse algorithms have a distressing tendency >> to >> scale quadratically with class x features. Remember that sparsity can be >> relative to an offset so that if a particular count is common for a class, >> feature compound, you can carry along that count as a default. > > > I am putting the formulation here. If anyone gets an idea please post > > If there are K labels k=1,2...K and J features j=1,2,.....J > if w_jk is the weight of feature_j in label k > Sigma_j is the sum of w_jk for all K(sum of weight of a feature j in all > labels) > Sigma_k is the sum of w_jk for all J(sum of weight of all the features in a > label k) > Sigma_j_Sigma_k is the sum of weights of all features in all labels. > > CNB formulation is weight = (Sigma_j - w_jk + 1 )/ ( Sigma_k_Sigma_j - > Sigma_k + J) > > I have the following data with me. > Sigma_j for all j's > Sigma_k for all k's > Sigma_k_Sigma_j > w_jk for all label=k,feature=j pair > * > My Current method of calculating the weight.* > Since Sigma_k, Sigma_kSigma_j and J are very small amount of data. They are > passed in the job conf to all the Mappers and Reducers > > Input to the Mapper: Sigma_j's and w_jk's > Map: > If the input is from Sigma_j (i have a way of differentiating > them) then output K values for each k output ["j,k" = > (Sigma_j + > 1.0)/(Sigma_k_Sigma_j - Sigma_k + J)] > If the input is from w_jk then output ["j,k" = > (-1 > /(Sigma_k_Sigma_j - Sigma_k + J) ] > > Reduce: > Add up all the values so we get ["j,k" = > (Sigma_j + 1.0 - w_jk) > /(Sigma_k_Sigma_j - Sigma_k + J)] > > > > > >> >> >> Another possible approach to sparsity is to represent your matrix as a sum >> of rank-1 components plus a highly sparse residual term. I can't say >> without thinking whether you can do this. Conceptually speaking, however, >> your data almost has to have some redundant representation because your >> original data is too small to encode a high complexity matrix. >> >> My next thought is that if you absolutely must operate on the dense >> matrix, >> then you need to phrase the problem in matrix terms and look for higher >> order operations. This can help scaling problems, but not cure them. >> >> The basic idea is that naive implementation of matrix-multiply requires >> n^3 >> arithmetic operations on roughly n^2 elements. If you do this poorly, >> then >> you have n^3 data communication operations which may kill your algorithm. >> If you order the operations well, then you can have roughly n^2 >> communication operations which scales better than the n^3 arithmetic >> operations and given the speed of modern processors relative to network >> speeds, will allow network bounded scaling up to n \approx 10^6 or so. >> Beyond that, you will be arithmetic bound. >> >> Note that matrix-multiply here is any operation that has the same "shape" >> as >> traditional matrix multiplications. >> >> Cant say i have the plan from what you said. Will have to read up on good > ordering techniques to see if it can be scaled up. > >> >> >> >> On Fri, Jul 11, 2008 at 4:59 PM, Robin Anil <[EMAIL PROTECTED]> wrote: >> >> > >> > So whats troubling is that we need to create a matrix of that scale >> which >> > can be accessible and updatable easily during map/reduce. Currently the >> > matrix is generated as cell_id = > value format in each stage of map >> > reduce. >> > Esentially each Job is iterating over the whole matix. >> > >> > >> > >> > > > > -- > Robin Anil > Senior Dual Degree Student > Department of Computer Science & Engineering > IIT Kharagpur > > > -------------------------------------------------------------------------------------------- > techdigger.wordpress.com > A discursive take on the world around us > > www.minekey.com > You Might Like This > > www.ithink.com > Express Yourself > -- Robin Anil Senior Dual Degree Student Department of Computer Science & Engineering IIT Kharagpur -------------------------------------------------------------------------------------------- techdigger.wordpress.com A discursive take on the world around us www.minekey.com You Might Like This www.ithink.com Express Yourself
