Hi, I had been experimenting with Wikipedia datadump(17GB) with the CNB classifier. I used a list of countries of the world(around 229 of them) as the labels and then created a classification dataset from the data dump. I assigned the documents to each label if any of the wikipedia category of the article has the country name in it. So a lot of data is pruned. The final Dataset is around 2.2GB
Now here is the predicament. In Complementary NB classifier you create a complement class for each label where the features of the complement class are the features of all the other class. This means for all the 20Million odd words in Wikipedia a float value weight is there for each label. In my code I generate this in the 4th Map stage. for each word I need to output N outputs (N is the number of labels) of the form <"label,feature", sum_of_weights of features>. This explodes the whole data in the system so after the Map stage I am left with 200M x 20 = 40Billion keyvalue pairs. This really slows things down. Took me over 2 hours and a lot of diskspace(over 26GB). Does anyone have any idea of doing this in an alternate way? One thing i am definitely doing is replacing all labels and features by integers. Please pour in optmisation ideas. I will submit this patch soon so that everyone can check out. Robin
