Re: Making Mahout Leaner

Jake Mannix Fri, 11 May 2012 06:54:39 -0700

Robin,

  Given that our "state of the art" classification currently is the SGD
stuff,
I'd target toward those interfaces (but not worry about the
feature-hashing).
Why do you need new collocation stuff?  Can this implementation not
work on raw Vector instances?  Let seq2sparse generate the vectors for
you and use that?  The tf-idf and tf vectors should both already exist,
whether for unigrams or for ngrams (depending on how you called
seq2sparse), right?


  As for integrating with ConfusingMatrix and other eval tools, I have not
too much to offer as advice...

On Thu, May 10, 2012 at 8:39 PM, Robin Anil <robin.a...@gmail.com> wrote:

> Any directions on what pattern I should follow for the redesign.
> ------
> Robin Anil
>
>
> On Wed, May 9, 2012 at 9:49 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > I believe most of this new NB discussion has been over chat. So here is
> > the state of the NB universe from my view
> >
> > 1) Original NB and CNB code was as follows
> >     - Tokenize and find all possible collocations
> >     - Compute Tf and Idf for each ngram
> >     - Compute Global and per class sums for tf, idf and tf-idf
> >     - Dump these counts in SequenceFiles
> >     - Load these into memory or Hbase and do compute score for each
> > vector-label combination
> >
> > Issues
> > A) its slow (collocation thought its efficient in its implementation
> (zero
> > memory overhead, using secondary sort), just explodes the learning time)
> > B) Its a memory hog. For really large models you really need Hbase to
> > store counts efficiently for large models. The class has a cache for
> > frequently used words in the language. So the overhead of classification
> is
> > based on the number on infrequent words in the document and the amount of
> > parallel lookups you can do on a Hbase cluster.
> >
> >
> > The new NB and CNB code is as follows:
> >    - The redesigned naive bayes doesnt work over words. It assumes input
> > is document vector and computes tf-idf and weights. (This is implemented)
> >    - The perclass weight vectors are kept in memory and updated. So the
> > limiting factor here is the number of classes * number of dimensions.
> (This
> > is implemented)
> >    - If the vector space is limited using randomized hashing (ted's
> > technique), then you can limit the space. However for (all possible)
> ngrams
> > you will need a large dimension, which makes it unusable. (This is not
> > done).
> >    - So one needs to create collocation vectors smartly (This is not
> done).
> >    - The implmentation as of now, learns the model, has model
> > serialization and deserialization methods, and an interface for
> classifying
> > using the loaded model. (This is implemented)
> >
> > Issues
> > A) It lacks train and test driver code. Its just has the core
> > implementation
> > B) It is not integrated with the evaluation classes (Confusion Matrix,
> Per
> > label precision/recall)
> > C) We need to port the collocations driver to generate collocations and
> > convert documents to vectors.
> > D) The multilabel classifier is not using any common interface like the
> > logistic regression package.
> >
> > When I checked in the code I didnt have time to pursue this. If someone
> > can recommend, the right approach to fixing this package. Like the right
> > interface to use, How it should behave with rest of the code. It becomes
> > easier for be to jump back on moulding the current implementation.
> >
> > ------
> > Robin Anil
> >
> >
> >
> > On Wed, May 9, 2012 at 5:48 AM, Grant Ingersoll <gsing...@apache.org
> >wrote:
> >
> >>
> >> On May 8, 2012, at 12:43 PM, Jake Mannix wrote:
> >>
> >> > On Tue, May 8, 2012 at 9:31 AM, Ted Dunning <ted.dunn...@gmail.com>
> >> wrote:
> >> >
> >> >> This is frustrating to consider losing Bayes, but I would consider
> >> keeping
> >> >> it if only to decrease the number of questions on the list about why
> >> the
> >> >> examples from the book don't work.
> >> >>
> >> >
> >> > Could maybe someone just sit down and rewrite it?  Naive Bayes is not
> a
> >> > particularly
> >> > difficult thing to implement, even distributed (it's like, word-count,
> >> > basically.  Ok,
> >> > maybe it's more like counting collocations, but still!).
> >> >
> >> > It would be pretty silly to not have an NB impl (although I agree that
> >> it's
> >> > even worse
> >> > to have a broken or clunky one).
> >>
> >> I agree.  The vector based one is a rewrite, so we probably should just
> >> go from there.  Not sure it is broken, but Robin is the primary person
> >> familiar with it and in the past I've pinged the list on the state of it
> >> (and trying to get explanations on certain parts of it) and not gotten
> >> answers.
> >
> >
> >
> >
> >> With all of these Hadoop algorithms, the other thing we really need is
> to
> >> make them programmatically easier to integrate.  The Driver mode is not
> too
> >> bad for testing, etc. but it makes it harder to integrate, as others
> have
> >> pointed out.
> >
> >
> >
>



-- 

  -jake

Re: Making Mahout Leaner

Reply via email to