Re: CNB: Learning from Huge Datasets

Ted Dunning Fri, 11 Jul 2008 16:39:01 -0700

Regardless of the arithmetic details, it still sounds like combiners will
have substantial impact.


You may have 40 B *potential* uniques, but I doubt you have that many actual
uniques.  The combiner will knock down the data size by average count.  If
you want to get medieval, you can even have the combiner not emit any
results that have less than 2 counts.  This is a very approximate pruning
and should be avoided because the result depends on degree of parallelism,
but it may help your cause substantially if nothing else will.

On Fri, Jul 11, 2008 at 4:24 PM, Robin Anil <[EMAIL PROTECTED]> wrote:

> On Fri, Jul 11, 2008 at 1:58 PM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
>
> > It sounds to me like there is scope here for combiners, especially on the
> > final stage.  If they can be applied to earlier stages as well, you might
> > be
> > able to collapse some of the data nicely.  If the number of unique words
> in
> > the corpus is a million, then a combiner might be able to improve the
> > number
> > of items in the intermediate map output of your last stage by up to two
> > orders of magnitude.
> >
> > Here my key is a label,feature pair so the number of uniques = 40B
>
>
> >
> >
> > Also, by my calculation, 20 x 200M = 4 x 10^9 (not 40 x 10^9).  Still
> > large,
> > but not vast.
>
>
> :) That was ~200(Countries)  x 200M. The result was alright
>
>
> >
> > On Thu, Jul 10, 2008 at 11:15 PM, Robin Anil <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hi,
> > >   I had been experimenting with Wikipedia datadump(17GB) with the CNB
> > > classifier. I used a list of countries of the world(around 229 of them)
> > as
> > > the labels and then created a classification dataset from the data
> dump.
> >  I
> > > assigned the documents to each label if any of the wikipedia category
> of
> > > the
> > > article has the country name in it. So a lot of data is pruned. The
> final
> > > Dataset is around 2.2GB
> > >
> > > Now here is the predicament. In Complementary NB classifier you create
> a
> > > complement class for each label where the features of the complement
> > class
> > > are the features of all the other class. This means for all the
> 20Million
> > > odd words in Wikipedia a float value weight is there for each label.
> > >
> > > In my code I generate this in the 4th Map stage.  for each word I need
> to
> > > output N  outputs  (N is the number of labels) of the form
> > > <"label,feature",
> > > sum_of_weights of features>. This explodes the whole data in the system
> > so
> > > after the Map stage I am left with 200M x 20 = 40Billion keyvalue
> pairs.
> > > This really slows things down. Took me over 2 hours and a lot of
> > > diskspace(over 26GB).  Does anyone have any idea of doing this in an
> > > alternate way? One thing i am definitely doing is replacing all labels
> > and
> > > features by integers. Please pour in optmisation ideas. I will submit
> > this
> > > patch soon so that everyone can check out.
> > >
> > >
> > > Robin
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> Robin Anil
> Senior Dual Degree Student
> Department of Computer Science & Engineering
> IIT Kharagpur
>
>
> --------------------------------------------------------------------------------------------
> techdigger.wordpress.com
> A discursive take on the world around us
>
> www.minekey.com
> You Might Like This
>
> www.ithink.com
> Express Yourself
>



-- 
ted

Re: CNB: Learning from Huge Datasets

Reply via email to