Sounds like you're on the right path. Looking at the misclassified
documents and the feature coefficients is a common way to debug a
classifier, especially if you use boolean features.

If you're using a sklearn vectorizer this might be of interest to you:
http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers

Remember to do cross validation and if your data set is big enough it is
usually good to keep an extra test set, which you don't use for optimizing
your features.


/Tobias





On Wed, Feb 19, 2014 at 8:57 PM, Pavel Soriano <sorianopa...@gmail.com>wrote:

> Hello scikit!
>
> I need some insights into what I am doing.
>
> Currently I am doing a text classifier (2 classes) using unigrams (word
> level) and some writing style features. I am using a Logistic Regression
> model, with L1 regularization. I have a decent performance (around .70
> f-measure) for the given corpus.
>
> I would like to make an error analysis, that is, to study the incorrectly
> classified documents and get some information from them, in order to maybe
> develop some rules to treat these cases or improve/modify my features.
>
> I thought about using the values of the coefficients of the fitted 
> logitequation to get a glimpse of what words in the vocabulary, or what style
> features, affect the most to the classification decision. Is it correct to
> assume that if the coefficient of a variable is positive, this means a
> higher importance of said variable towards "positive" label? If it is near
> to one, is almost 50/50 for the final classification, and if it is
> negative, it contributes towards the "negative" class?
>
> I have read about logit regression interpretation (Ref 
> 1<http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm>
> ,Ref 2<http://www.appstate.edu/~whiteheadjc/service/logit/intro.htm#interp>),
> and so it seems this is a correct way to interpret the coefficients, but I
> would like to be sure.
>
> If you have any other ideas of how to perform a different error analysis,
> please share them with me.
> Thanks for the input!
>
> Pavel SORIANO
>
>
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to