Hello! I'm new to Mahout, but I've been doing ML and Hadoop for a while now. I've got a fairly large dataset (about 1 billion records and ~200 columns) and a client interested in performing binary classification on the data. I've done some preliminary investigation with subsamples of the data in Weka and Naive Bayes performs surprisingly well. My data has some features with large numbers of nominal values in it that would benefit from very large sample sizes when training. In the end I need to do something like the following:
* Read data from a tab delimited file * Discretize the numeric data (simply equal interval binning will probably be fine) * Build a NB or similarly performing classifier on a relatively large training data set * Evaluate against a test set of similarly structured data * Generate ROC curves and similar evaluation metrics against the test set I've gone through the Twenty Newsgroups examples and it appears that Mahout has some of the building blocks I need, but may be missing others. I'm comfortable writing all of these pieces from scratch, but I'd prefer to build this functionality into Mahout or a similar open source project and I have the support of my employer to do so. Which leads me to my questions: Does Mahout already have all the functionality that I'm looking for and I just missed it? Would this be beneficial and in line with Mahout? If this does make sense, where would you suggest I start? Thanks in advance! Jason R. Surratt SPADAC email: [email protected]
