This sounds great. I would suggest you test the naive Bayes, complementary Naive Bayes, SVM and SGD implementations. Given that naive Bayes has worked well on a sample, you will probably be very happy with SVM and SGD since they handle very large cardinality well.
You will need to vectorize your input. Since you have many columns, you may want to look at Drew's document style stuff. See https://issues.apache.org/jira/browse/MAHOUT-274 There is the beginnings of some vectorization of hte sort you will need in the SGD patch: http://issues.apache.org/jira/browse/MAHOUT-228 That also has a learning system that will build your classifier using an on-line logistic regression. The SVM implementation is at http://issues.apache.org/jira/browse/MAHOUT-232 The NB and CNB implementations are in mahout itself already. On Wed, Feb 17, 2010 at 1:58 PM, Jason Surratt <[email protected]>wrote: > Which leads me to my questions: Does Mahout already have all the > functionality that I'm looking for and I just missed it? Would this be > beneficial and in line with Mahout? If this does make sense, where would you > suggest I start? > -- Ted Dunning, CTO DeepDyve
