Answers inline

On Sun, Sep 11, 2011 at 7:49 PM, Grant Ingersoll <[email protected]>wrote:

> A few classifier questions:
>
> What's the difference between the two naive bayes packages?  AFAICT, the
> naivebayes works off of vectors already, but are there any differences in
> the algorithms themselves? ]

No change in the algorithm. The naivebayes package doesnt implement the
tfidf portion of the old NB job. It just works off any given set of vectors

> In other words, if I do seq2sparse to get vectors in, all should be good to
> go w/ the new vector based naive bayes, right?

yes, generate vectors using either seq2sparse or a Randomizer. Feed that
into the naivebayes model builder. It builds a vector for each class and
outputs them as sequence files.

 Do we have docs on the new naivebayes package anywhere?

Unfortunately I havent made a wiki entry yet. I will put one it soon. The
naivebayes package only has the model building code. Once the model is
built, the class vectors need to be loaded and the scores need to be
computed for each new document (dot product). This part is not yet there.

>   For instance, how do the labels get associated with the training
> examples?  I see the --labels option, but it isn't clear how it relates to
> the training data.
>
The list of labels need to be provided for the model builder to generate
indices out of it. The input data itself has to be in
Record<String Label, VectorWritable vector>


>
> As for SplitBayesInput, I don't see that being used anywhere, but I think I
> have a case for it.  The only thing is, I want it to work off of
> SequenceFiles and split them, I think (b/c I want to run the new naivebayes
> package)  Does that make sense?

I haven't used this tool myself. But looks like its mimicking the old bayes
text format. It might need a bit of rewrite

>


> Here's what I'm ultimately trying to do:
> I've got all this ASF email data.  It's currently bucketed like the news
> groups stuff, so I thought I would build a similar example (but one that
> actually makes sense to run in a cluster due to size).  I want to take and
> split the data into test and training sets across all the mailing lists such
> that one could attempt to classify new mail as to which project it belongs
> to (it will be curious to see how it compares dev lists vs. user lists.)
>  WIP is at github.com/lucidimagination/mahout.
>
Yes. I would suggest vectorizing the whole dataset first. Then write the
splitter for that format

>
> Given time, I'd also like to hook in some of the various other classifiers,
> as I think it would be useful to be able to have a single example, with real
> data, that runs all the various algorithms (clustering, classification, CF,
> etc.)
>
> -Grant
>
>

Reply via email to