On Sep 11, 2011, at 10:19 AM, Grant Ingersoll wrote: > A few classifier questions: > > What's the difference between the two naive bayes packages? AFAICT, the > naivebayes works off of vectors already, but are there any differences in the > algorithms themselves? In other words, if I do seq2sparse to get vectors in, > all should be good to go w/ the new vector based naive bayes, right? Do we > have docs on the new naivebayes package anywhere? For instance, how do the > labels get associated with the training examples? I see the --labels option, > but it isn't clear how it relates to the training data. > > As for SplitBayesInput, I don't see that being used anywhere, but I think I > have a case for it. The only thing is, I want it to work off of > SequenceFiles and split them, I think (b/c I want to run the new naivebayes > package) Does that make sense? > > Here's what I'm ultimately trying to do: > I've got all this ASF email data. It's currently bucketed like the news > groups stuff, so I thought I would build a similar example (but one that > actually makes sense to run in a cluster due to size). I want to take and > split the data into test and training sets across all the mailing lists such > that one could attempt to classify new mail as to which project it belongs to > (it will be curious to see how it compares dev lists vs. user lists.) WIP is > at github.com/lucidimagination/mahout. >
Just to follow up, my current plan would be to do: 1. Raw mail -> sequence files (SequenceFilesFromMailArchives) 2. seq2sparse 3. SplitBayesInput (which really should be renamed to just SplitInput, as there is nothing "Bayes" about it) -- also, make it work with Sequence files 4. Run training 5. Run test 6. Conquer the world > Given time, I'd also like to hook in some of the various other classifiers, > as I think it would be useful to be able to have a single example, with real > data, that runs all the various algorithms (clustering, classification, CF, > etc.) > > -Grant >
