Re: text categorization with SVM and NaiveBayes

Tom Fawcett Mon, 08 Jan 2007 12:23:58 -0800

On Jan 7, 2007, at 9:23 PM, Ken Williams wrote:

I would happily ignore all this and use NB, but it has one majorflaw.
"The winner takes it all", the first result returned is way too far
(as in distance :)) from the others, which isn't exactly accurate if
one cares of a balanced results pool. I don't know whether this is an
implementation problem - I poked around the rescale() function in
Util.pm with no real success - or a general algorithm problem. Mygoalis to have an implementation that can say: this text is 60% cat X,20%
cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
approach would you recommend (which algorithm, whichimplementation or
what path for implementing it ) ?
Unfortunately, neither NB nor SVMs can really tell you that. SVMsare purely discriminative, so all they can tell you is "I thinkthis new example is more like class A than class B in my trainingdata". There's no probability involved at all. That said, Ibelieve there has been some research into how to translate SVMoutput scores into probabilities or confidence scores, but I'm notreally familiar with it.
NB on the surface would seem to be a better option since it'sdirectly based on probabilities, but again the algorithm wasdesigned only to discriminate, so all those denominators that arethrown away (the "P(words)" terms in the A::NB documentation) meanthat the notion of probabilities is lost. The rescale() functionis basically just a hack to return scores that are a little moreconvenient to work with than the raw output of the algorithm. Asyou've seen, it tends to be a little arrogant, greatly exaggeratingthe score for the first category and giving tiny scores to therest. I'm sure there are better algorithms that could be usedthere, but in many cases either one doesn't really care about theactual scores, or one (*ahem*) does something ad hoc like takingthe square root of all the scores, or the fifth root, or whatever,just to get some numbers that look better to end users.

Just to add a note here: Ken is correct -- both NB and SVMs are knownto be rather poor at providing accurate probabilities. Their scorestend to be too extreme. Producing good probabilities from thesescores is called calibrating the classifier, and it's more complexthan just taking a root of the score. There are several methods forcalibrating scores. The good news is that there's an effective onecalled isotonic regression (or Pool Adjacent Violators) which ispretty easy and fast. The bad news is that there's no plug-in (ie,CPAN-ready) perl implementation of it (I've got a simpleimplementation which I should convert and contribute someday).

If you want to read about classifier calibration, google one of thesetitles:

"Transforming classifier scores into accurate multiclass probabilityestimates"

by Bianca Zadrozny and Charles Elkan

"Predicting Good Probabilities With Supervised Learning"
by A. Niculescu-Mizil and R. Caruana

Regards,
-Tom

Re: text categorization with SVM and NaiveBayes

Reply via email to