I think the discussion is moving away from the actual problem. Let me start again. The CNB stores for every label and for every feature some weight (40B values in the case of this data of wikipedia). The intuition behind that is that: If different labels have different number of documents then it causes a bias effect for those labels with large number of data.
CNB overcomes this bias by creating a complement class( C_bar which has features of all the other classes ~200M except those unique to his which is small compared to 200M) In essence the total data that needs to be stored in 40B the entire matrix. Normally in the case of NB. we just need to store the features that occcur in class C (which creates a very sparse matrix). So whats troubling is that we need to create a matrix of that scale which can be accessible and updatable easily during map/reduce. Currently the matrix is generated as cell_id = > value format in each stage of map reduce. Esentially each Job is iterating over the whole matix. So even with combiners there is not difference because the input of this job is unique features. In the Map job for each feature, I emit 200 key-values (for each of the labels) which is again unique. On Sat, Jul 12, 2008 at 5:08 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Regardless of the arithmetic details, it still sounds like combiners will > have substantial impact. > > You may have 40 B *potential* uniques, but I doubt you have that many > actual > uniques. The combiner will knock down the data size by average count. If > you want to get medieval, you can even have the combiner not emit any > results that have less than 2 counts. This is a very approximate pruning > and should be avoided because the result depends on degree of parallelism, > but it may help your cause substantially if nothing else will. > > On Fri, Jul 11, 2008 at 4:24 PM, Robin Anil <[EMAIL PROTECTED]> wrote: > > > On Fri, Jul 11, 2008 at 1:58 PM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > It sounds to me like there is scope here for combiners, especially on > the > > > final stage. If they can be applied to earlier stages as well, you > might > > > be > > > able to collapse some of the data nicely. If the number of unique > words > > in > > > the corpus is a million, then a combiner might be able to improve the > > > number > > > of items in the intermediate map output of your last stage by up to two > > > orders of magnitude. > > > > > > Here my key is a label,feature pair so the number of uniques = 40B > > > > > > > > > > > > > Also, by my calculation, 20 x 200M = 4 x 10^9 (not 40 x 10^9). Still > > > large, > > > but not vast. > > > > > > :) That was ~200(Countries) x 200M. The result was alright > > > > > > > > > > On Thu, Jul 10, 2008 at 11:15 PM, Robin Anil <[EMAIL PROTECTED]> > > wrote: > > > > > > > Hi, > > > > I had been experimenting with Wikipedia datadump(17GB) with the CNB > > > > classifier. I used a list of countries of the world(around 229 of > them) > > > as > > > > the labels and then created a classification dataset from the data > > dump. > > > I > > > > assigned the documents to each label if any of the wikipedia category > > of > > > > the > > > > article has the country name in it. So a lot of data is pruned. The > > final > > > > Dataset is around 2.2GB > > > > > > > > Now here is the predicament. In Complementary NB classifier you > create > > a > > > > complement class for each label where the features of the complement > > > class > > > > are the features of all the other class. This means for all the > > 20Million > > > > odd words in Wikipedia a float value weight is there for each label. > > > > > > > > In my code I generate this in the 4th Map stage. for each word I > need > > to > > > > output N outputs (N is the number of labels) of the form > > > > <"label,feature", > > > > sum_of_weights of features>. This explodes the whole data in the > system > > > so > > > > after the Map stage I am left with 200M x 20 = 40Billion keyvalue > > pairs. > > > > This really slows things down. Took me over 2 hours and a lot of > > > > diskspace(over 26GB). Does anyone have any idea of doing this in an > > > > alternate way? One thing i am definitely doing is replacing all > labels > > > and > > > > features by integers. Please pour in optmisation ideas. I will submit > > > this > > > > patch soon so that everyone can check out. > > > > > > > > > > > > Robin > > > > > > > > > > > > > > > > -- > > > ted > > > > > > > > > > > -- > > Robin Anil > > Senior Dual Degree Student > > Department of Computer Science & Engineering > > IIT Kharagpur > > > > > > > -------------------------------------------------------------------------------------------- > > techdigger.wordpress.com > > A discursive take on the world around us > > > > www.minekey.com > > You Might Like This > > > > www.ithink.com > > Express Yourself > > > > > > -- > ted > -- Robin Anil Senior Dual Degree Student Department of Computer Science & Engineering IIT Kharagpur -------------------------------------------------------------------------------------------- techdigger.wordpress.com A discursive take on the world around us www.minekey.com You Might Like This www.ithink.com Express Yourself
