A few classifier questions: What's the difference between the two naive bayes packages? AFAICT, the naivebayes works off of vectors already, but are there any differences in the algorithms themselves? In other words, if I do seq2sparse to get vectors in, all should be good to go w/ the new vector based naive bayes, right? Do we have docs on the new naivebayes package anywhere? For instance, how do the labels get associated with the training examples? I see the --labels option, but it isn't clear how it relates to the training data.
As for SplitBayesInput, I don't see that being used anywhere, but I think I have a case for it. The only thing is, I want it to work off of SequenceFiles and split them, I think (b/c I want to run the new naivebayes package) Does that make sense? Here's what I'm ultimately trying to do: I've got all this ASF email data. It's currently bucketed like the news groups stuff, so I thought I would build a similar example (but one that actually makes sense to run in a cluster due to size). I want to take and split the data into test and training sets across all the mailing lists such that one could attempt to classify new mail as to which project it belongs to (it will be curious to see how it compares dev lists vs. user lists.) WIP is at github.com/lucidimagination/mahout. Given time, I'd also like to hook in some of the various other classifiers, as I think it would be useful to be able to have a single example, with real data, that runs all the various algorithms (clustering, classification, CF, etc.) -Grant
