Re: text categorization with SVM and NaiveBayes

Ken Williams Mon, 08 Jan 2007 04:20:15 -0800


On Jan 5, 2007, at 7:10 AM, zgrim wrote:

So, back to my dilemmas. :) The results are puzzling, as many of the
research papers on the subject I've consulted say that SVM is
supposedly the best algorithm for this task. The radial kernel should
give the best results, for empirical-found values of gamma and C.

This may be an issue with your corpus - I quite often find that whenI don't have enough training data for the SVM to pick up on the"truth" patterns, or (somewhat equivalently) when there's a lot ofnoise in the data, a linear kernel will outperform a radial (RBF). Itend to think that's because the RBF is more expressive, and it'soverfitting the noise in the training set.

Ignoring the fact that SVM is much, much slower to train than NB, it
still has worse accuracy. What am I doing wrong ?

That may be an accident of your corpus too. Are you using cross-validation for these experiments? If so, you should be able to getsome error bars to tell whether the difference is statisticallysignificant or not. I'm guessing a 2% advantage may not be, in thiscase.

I would happily ignore all this and use NB, but it has one major flaw.
"The winner takes it all", the first result returned is way too far
(as in distance :)) from the others, which isn't exactly accurate if
one cares of a balanced results pool. I don't know whether this is an
implementation problem - I poked around the rescale() function in
Util.pm with no real success - or a general algorithm problem. My goal
is to have an implementation that can say: this text is 60% cat X, 20%
cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what
approach would you recommend (which algorithm, which implementation or
what path for implementing it ) ?

Unfortunately, neither NB nor SVMs can really tell you that. SVMsare purely discriminative, so all they can tell you is "I think thisnew example is more like class A than class B in my training data".There's no probability involved at all. That said, I believe therehas been some research into how to translate SVM output scores intoprobabilities or confidence scores, but I'm not really familiar with it.

NB on the surface would seem to be a better option since it'sdirectly based on probabilities, but again the algorithm was designedonly to discriminate, so all those denominators that are thrown away(the "P(words)" terms in the A::NB documentation) mean that thenotion of probabilities is lost. The rescale() function is basicallyjust a hack to return scores that are a little more convenient towork with than the raw output of the algorithm. As you've seen, ittends to be a little arrogant, greatly exaggerating the score for thefirst category and giving tiny scores to the rest. I'm sure thereare better algorithms that could be used there, but in many caseseither one doesn't really care about the actual scores, or one(*ahem*) does something ad hoc like taking the square root of all thescores, or the fifth root, or whatever, just to get some numbers thatlook better to end users.

As for a better alternative, I'm not familiar with any that will beas accessible from a perl world, but you might want to look at somelanguage modeling papers - I really like the LDA papers from MichaelJordan (no, not that Michael Jordan, this one: http://citeseer.ist.psu.edu/541352.html), which are by no meansstraightforward, but they will indeed let you describe each documentas generated by a mixture of categories.


 -Ken

Re: text categorization with SVM and NaiveBayes

Reply via email to