Hi,
   I had been experimenting with Wikipedia datadump(17GB) with the CNB
classifier. I used a list of countries of the world(around 229 of them) as
the labels and then created a classification dataset from the data dump.  I
assigned the documents to each label if any of the wikipedia category of the
article has the country name in it. So a lot of data is pruned. The final
Dataset is around 2.2GB

Now here is the predicament. In Complementary NB classifier you create a
complement class for each label where the features of the complement class
are the features of all the other class. This means for all the 20Million
odd words in Wikipedia a float value weight is there for each label.

In my code I generate this in the 4th Map stage.  for each word I need to
output N  outputs  (N is the number of labels) of the form <"label,feature",
sum_of_weights of features>. This explodes the whole data in the system so
after the Map stage I am left with 200M x 20 = 40Billion keyvalue pairs.
This really slows things down. Took me over 2 hours and a lot of
diskspace(over 26GB).  Does anyone have any idea of doing this in an
alternate way? One thing i am definitely doing is replacing all labels and
features by integers. Please pour in optmisation ideas. I will submit this
patch soon so that everyone can check out.


Robin

Reply via email to