Robin, Given that our "state of the art" classification currently is the SGD stuff, I'd target toward those interfaces (but not worry about the feature-hashing). Why do you need new collocation stuff? Can this implementation not work on raw Vector instances? Let seq2sparse generate the vectors for you and use that? The tf-idf and tf vectors should both already exist, whether for unigrams or for ngrams (depending on how you called seq2sparse), right?
As for integrating with ConfusingMatrix and other eval tools, I have not too much to offer as advice... On Thu, May 10, 2012 at 8:39 PM, Robin Anil <robin.a...@gmail.com> wrote: > Any directions on what pattern I should follow for the redesign. > ------ > Robin Anil > > > On Wed, May 9, 2012 at 9:49 AM, Robin Anil <robin.a...@gmail.com> wrote: > > > I believe most of this new NB discussion has been over chat. So here is > > the state of the NB universe from my view > > > > 1) Original NB and CNB code was as follows > > - Tokenize and find all possible collocations > > - Compute Tf and Idf for each ngram > > - Compute Global and per class sums for tf, idf and tf-idf > > - Dump these counts in SequenceFiles > > - Load these into memory or Hbase and do compute score for each > > vector-label combination > > > > Issues > > A) its slow (collocation thought its efficient in its implementation > (zero > > memory overhead, using secondary sort), just explodes the learning time) > > B) Its a memory hog. For really large models you really need Hbase to > > store counts efficiently for large models. The class has a cache for > > frequently used words in the language. So the overhead of classification > is > > based on the number on infrequent words in the document and the amount of > > parallel lookups you can do on a Hbase cluster. > > > > > > The new NB and CNB code is as follows: > > - The redesigned naive bayes doesnt work over words. It assumes input > > is document vector and computes tf-idf and weights. (This is implemented) > > - The perclass weight vectors are kept in memory and updated. So the > > limiting factor here is the number of classes * number of dimensions. > (This > > is implemented) > > - If the vector space is limited using randomized hashing (ted's > > technique), then you can limit the space. However for (all possible) > ngrams > > you will need a large dimension, which makes it unusable. (This is not > > done). > > - So one needs to create collocation vectors smartly (This is not > done). > > - The implmentation as of now, learns the model, has model > > serialization and deserialization methods, and an interface for > classifying > > using the loaded model. (This is implemented) > > > > Issues > > A) It lacks train and test driver code. Its just has the core > > implementation > > B) It is not integrated with the evaluation classes (Confusion Matrix, > Per > > label precision/recall) > > C) We need to port the collocations driver to generate collocations and > > convert documents to vectors. > > D) The multilabel classifier is not using any common interface like the > > logistic regression package. > > > > When I checked in the code I didnt have time to pursue this. If someone > > can recommend, the right approach to fixing this package. Like the right > > interface to use, How it should behave with rest of the code. It becomes > > easier for be to jump back on moulding the current implementation. > > > > ------ > > Robin Anil > > > > > > > > On Wed, May 9, 2012 at 5:48 AM, Grant Ingersoll <gsing...@apache.org > >wrote: > > > >> > >> On May 8, 2012, at 12:43 PM, Jake Mannix wrote: > >> > >> > On Tue, May 8, 2012 at 9:31 AM, Ted Dunning <ted.dunn...@gmail.com> > >> wrote: > >> > > >> >> This is frustrating to consider losing Bayes, but I would consider > >> keeping > >> >> it if only to decrease the number of questions on the list about why > >> the > >> >> examples from the book don't work. > >> >> > >> > > >> > Could maybe someone just sit down and rewrite it? Naive Bayes is not > a > >> > particularly > >> > difficult thing to implement, even distributed (it's like, word-count, > >> > basically. Ok, > >> > maybe it's more like counting collocations, but still!). > >> > > >> > It would be pretty silly to not have an NB impl (although I agree that > >> it's > >> > even worse > >> > to have a broken or clunky one). > >> > >> I agree. The vector based one is a rewrite, so we probably should just > >> go from there. Not sure it is broken, but Robin is the primary person > >> familiar with it and in the past I've pinged the list on the state of it > >> (and trying to get explanations on certain parts of it) and not gotten > >> answers. > > > > > > > > > >> With all of these Hadoop algorithms, the other thing we really need is > to > >> make them programmatically easier to integrate. The Driver mode is not > too > >> bad for testing, etc. but it makes it harder to integrate, as others > have > >> pointed out. > > > > > > > -- -jake