On Jan 7, 2007, at 9:23 PM, Ken Williams wrote:
I would happily ignore all this and use NB, but it has one major
flaw.
"The winner takes it all", the first result returned is way too far
(as in distance :)) from the others, which isn't exactly accurate if
one cares of a balanced results pool. I don't know whether this is an
implementation problem - I poked around the rescale() function in
Util.pm with no real success - or a general algorithm problem. My
goal
is to have an implementation that can say: this text is 60% cat X,
20%
cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
approach would you recommend (which algorithm, which
implementation or
what path for implementing it ) ?
Unfortunately, neither NB nor SVMs can really tell you that. SVMs
are purely discriminative, so all they can tell you is "I think
this new example is more like class A than class B in my training
data". There's no probability involved at all. That said, I
believe there has been some research into how to translate SVM
output scores into probabilities or confidence scores, but I'm not
really familiar with it.
NB on the surface would seem to be a better option since it's
directly based on probabilities, but again the algorithm was
designed only to discriminate, so all those denominators that are
thrown away (the "P(words)" terms in the A::NB documentation) mean
that the notion of probabilities is lost. The rescale() function
is basically just a hack to return scores that are a little more
convenient to work with than the raw output of the algorithm. As
you've seen, it tends to be a little arrogant, greatly exaggerating
the score for the first category and giving tiny scores to the
rest. I'm sure there are better algorithms that could be used
there, but in many cases either one doesn't really care about the
actual scores, or one (*ahem*) does something ad hoc like taking
the square root of all the scores, or the fifth root, or whatever,
just to get some numbers that look better to end users.
Just to add a note here: Ken is correct -- both NB and SVMs are known
to be rather poor at providing accurate probabilities. Their scores
tend to be too extreme. Producing good probabilities from these
scores is called calibrating the classifier, and it's more complex
than just taking a root of the score. There are several methods for
calibrating scores. The good news is that there's an effective one
called isotonic regression (or Pool Adjacent Violators) which is
pretty easy and fast. The bad news is that there's no plug-in (ie,
CPAN-ready) perl implementation of it (I've got a simple
implementation which I should convert and contribute someday).
If you want to read about classifier calibration, google one of these
titles:
"Transforming classifier scores into accurate multiclass probability
estimates"
by Bianca Zadrozny and Charles Elkan
"Predicting Good Probabilities With Supervised Learning"
by A. Niculescu-Mizil and R. Caruana
Regards,
-Tom