Hello, I am "playing" with the task of automated text categorization and inevitably hit a few dilemmas. I have tried different combinations of SVM and NaiveBayes, here are some results: - algorithm::svm (single world, through AI::Categorizer) ~ 92% accuracy (with the linear kernel, the radial one has bellow 10% with all sorts of values tried for gamma and c) - algorithm::svmlight ( nr. of categories worlds - each trained against the others ) ~ 62% in ranking mode - algorithm::naivebayes (one world, through AI::Categorizer) ~ 94% - algorithm::naivebayes (each against all other) ~ 73%
These are on the same corpus ( which isn't perfect at all, but that a negligible information for now :) ). By accuracy I mean tested accuracy on a single category, which is, if the first category returned (highest score) is the supposed one, it's a hit, else, a miss. By single world I mean all categories build a single model, against tests are run. By multiple worlds (each against all other) I mean each category builds a model in which the tokens from that category are positive and the tokens from all other categories are negative. So, back to my dilemmas. :) The results are puzzling, as many of the research papers on the subject I've consulted say that SVM is supposedly the best algorithm for this task. The radial kernel should give the best results, for empirical-found values of gamma and C. Ignoring the fact that SVM is much, much slower to train than NB, it still has worse accuracy. What am I doing wrong ? I would happily ignore all this and use NB, but it has one major flaw. "The winner takes it all", the first result returned is way too far (as in distance :)) from the others, which isn't exactly accurate if one cares of a balanced results pool. I don't know whether this is an implementation problem - I poked around the rescale() function in Util.pm with no real success - or a general algorithm problem. My goal is to have an implementation that can say: this text is 60% cat X, 20% cat Y, 18% cat Z and 2% other cats. Is this feasible ? If so, what approach would you recommend (which algorithm, which implementation or what path for implementing it ) ? TIA -- perl -MLWP::Simple -e'print$_[rand(split(q|%%\n|, get(q=http://cpan.org/misc/japh=)))]'