A few classifier questions:

What's the difference between the two naive bayes packages?  AFAICT, the 
naivebayes works off of vectors already, but are there any differences in the 
algorithms themselves?  In other words, if I do seq2sparse to get vectors in, 
all should be good to go w/ the new vector based naive bayes, right?  Do we 
have docs on the new naivebayes package anywhere?   For instance, how do the 
labels get associated with the training examples?  I see the --labels option, 
but it isn't clear how it relates to the training data.

As for SplitBayesInput, I don't see that being used anywhere, but I think I 
have a case for it.  The only thing is, I want it to work off of SequenceFiles 
and split them, I think (b/c I want to run the new naivebayes package)  Does 
that make sense?

Here's what I'm ultimately trying to do:
I've got all this ASF email data.  It's currently bucketed like the news groups 
stuff, so I thought I would build a similar example (but one that actually 
makes sense to run in a cluster due to size).  I want to take and split the 
data into test and training sets across all the mailing lists such that one 
could attempt to classify new mail as to which project it belongs to (it will 
be curious to see how it compares dev lists vs. user lists.)  WIP is at 
github.com/lucidimagination/mahout.

Given time, I'd also like to hook in some of the various other classifiers, as 
I think it would be useful to be able to have a single example, with real data, 
that runs all the various algorithms (clustering, classification, CF, etc.)  

-Grant

Reply via email to