Re: Bayes, NaiveBayes, SplitBayesInput and other questions

Grant Ingersoll Sun, 11 Sep 2011 08:39:51 -0700

On Sep 11, 2011, at 10:19 AM, Grant Ingersoll wrote:

> A few classifier questions:
> 
> What's the difference between the two naive bayes packages?  AFAICT, the 
> naivebayes works off of vectors already, but are there any differences in the 
> algorithms themselves?  In other words, if I do seq2sparse to get vectors in, 
> all should be good to go w/ the new vector based naive bayes, right?  Do we 
> have docs on the new naivebayes package anywhere?   For instance, how do the 
> labels get associated with the training examples?  I see the --labels option, 
> but it isn't clear how it relates to the training data.
> 
> As for SplitBayesInput, I don't see that being used anywhere, but I think I 
> have a case for it.  The only thing is, I want it to work off of 
> SequenceFiles and split them, I think (b/c I want to run the new naivebayes 
> package)  Does that make sense?
> 
> Here's what I'm ultimately trying to do:
> I've got all this ASF email data.  It's currently bucketed like the news 
> groups stuff, so I thought I would build a similar example (but one that 
> actually makes sense to run in a cluster due to size).  I want to take and 
> split the data into test and training sets across all the mailing lists such 
> that one could attempt to classify new mail as to which project it belongs to 
> (it will be curious to see how it compares dev lists vs. user lists.)  WIP is 
> at github.com/lucidimagination/mahout.
>


Just to follow up, my current plan would be to do:
1. Raw mail -> sequence files (SequenceFilesFromMailArchives)
2. seq2sparse
3. SplitBayesInput (which really should be renamed to just SplitInput, as there 
is nothing "Bayes" about it) -- also, make it work with Sequence files
4. Run training
5. Run test
6. Conquer the world

> Given time, I'd also like to hook in some of the various other classifiers, 
> as I think it would be useful to be able to have a single example, with real 
> data, that runs all the various algorithms (clustering, classification, CF, 
> etc.)  
> 
> -Grant
>

Re: Bayes, NaiveBayes, SplitBayesInput and other questions

Reply via email to