Hello!

I'm new to Mahout, but I've been doing ML and Hadoop for a while now. I've got 
a fairly large dataset (about 1 billion records and ~200 columns) and a client 
interested in performing binary classification on the data. I've done some 
preliminary investigation with subsamples of the data in Weka and Naive Bayes 
performs surprisingly well. My data has some features with large numbers of 
nominal values in it that would benefit from very large sample sizes when 
training. In the end I need to do something like the following:


*         Read data from a tab delimited file

*         Discretize the numeric data (simply equal interval binning will 
probably be fine)

*         Build a NB or similarly performing classifier on a relatively large 
training data set

*         Evaluate against a test set of similarly structured data

*         Generate ROC curves and similar evaluation metrics against the test 
set

I've gone through the Twenty Newsgroups examples and it appears that Mahout has 
some of the building blocks I need, but may be missing others. I'm comfortable 
writing all of these pieces from scratch, but I'd prefer to build this 
functionality into Mahout or a similar open source project and I have the 
support of my employer to do so.

Which leads me to my questions: Does Mahout already have all the functionality 
that I'm looking for and I just missed it? Would this be beneficial and in line 
with Mahout? If this does make sense, where would you suggest I start?

Thanks in advance!

Jason R. Surratt
SPADAC
email: [email protected]

Reply via email to