Small change in your world domination plan On Sun, Sep 11, 2011 at 9:09 PM, Grant Ingersoll <[email protected]>wrote:
> > On Sep 11, 2011, at 10:19 AM, Grant Ingersoll wrote: > > > A few classifier questions: > > > > What's the difference between the two naive bayes packages? AFAICT, the > naivebayes works off of vectors already, but are there any differences in > the algorithms themselves? In other words, if I do seq2sparse to get > vectors in, all should be good to go w/ the new vector based naive bayes, > right? Do we have docs on the new naivebayes package anywhere? For > instance, how do the labels get associated with the training examples? I > see the --labels option, but it isn't clear how it relates to the training > data. > > > > As for SplitBayesInput, I don't see that being used anywhere, but I think > I have a case for it. The only thing is, I want it to work off of > SequenceFiles and split them, I think (b/c I want to run the new naivebayes > package) Does that make sense? > > > > Here's what I'm ultimately trying to do: > > I've got all this ASF email data. It's currently bucketed like the news > groups stuff, so I thought I would build a similar example (but one that > actually makes sense to run in a cluster due to size). I want to take and > split the data into test and training sets across all the mailing lists such > that one could attempt to classify new mail as to which project it belongs > to (it will be curious to see how it compares dev lists vs. user lists.) > WIP is at github.com/lucidimagination/mahout. > > > > Just to follow up, my current plan would be to do: > 1. Raw mail -> sequence files (SequenceFilesFromMailArchives) > 2. seq2sparse > 3. SplitBayesInput (which really should be renamed to just SplitInput, as > there is nothing "Bayes" about it) -- also, make it work with Sequence files > 4. Run training > 5. Run test (need to load up the class vectors and compute dot products) > 6. Conquer the world > > > Given time, I'd also like to hook in some of the various other > classifiers, as I think it would be useful to be able to have a single > example, with real data, that runs all the various algorithms (clustering, > classification, CF, etc.) > > > > -Grant > > > > > >
