On Fri, Jul 11, 2008 at 1:58 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> It sounds to me like there is scope here for combiners, especially on the > final stage. If they can be applied to earlier stages as well, you might > be > able to collapse some of the data nicely. If the number of unique words in > the corpus is a million, then a combiner might be able to improve the > number > of items in the intermediate map output of your last stage by up to two > orders of magnitude. > > Here my key is a label,feature pair so the number of uniques = 40B > > > Also, by my calculation, 20 x 200M = 4 x 10^9 (not 40 x 10^9). Still > large, > but not vast. :) That was ~200(Countries) x 200M. The result was alright > > On Thu, Jul 10, 2008 at 11:15 PM, Robin Anil <[EMAIL PROTECTED]> wrote: > > > Hi, > > I had been experimenting with Wikipedia datadump(17GB) with the CNB > > classifier. I used a list of countries of the world(around 229 of them) > as > > the labels and then created a classification dataset from the data dump. > I > > assigned the documents to each label if any of the wikipedia category of > > the > > article has the country name in it. So a lot of data is pruned. The final > > Dataset is around 2.2GB > > > > Now here is the predicament. In Complementary NB classifier you create a > > complement class for each label where the features of the complement > class > > are the features of all the other class. This means for all the 20Million > > odd words in Wikipedia a float value weight is there for each label. > > > > In my code I generate this in the 4th Map stage. for each word I need to > > output N outputs (N is the number of labels) of the form > > <"label,feature", > > sum_of_weights of features>. This explodes the whole data in the system > so > > after the Map stage I am left with 200M x 20 = 40Billion keyvalue pairs. > > This really slows things down. Took me over 2 hours and a lot of > > diskspace(over 26GB). Does anyone have any idea of doing this in an > > alternate way? One thing i am definitely doing is replacing all labels > and > > features by integers. Please pour in optmisation ideas. I will submit > this > > patch soon so that everyone can check out. > > > > > > Robin > > > > > > -- > ted > -- Robin Anil Senior Dual Degree Student Department of Computer Science & Engineering IIT Kharagpur -------------------------------------------------------------------------------------------- techdigger.wordpress.com A discursive take on the world around us www.minekey.com You Might Like This www.ithink.com Express Yourself
