Re: [Scikit-learn-general] Logistic regression coefficients analysis

Paolo Di Prodi Thu, 20 Feb 2014 02:09:28 -0800

What about using a distance metric like this one?
http://en.wikipedia.org/wiki/Normalized_Google_distance
________________________________________
From: Joel Nothman [joel.noth...@gmail.com]
Sent: 19 February 2014 22:50
To: scikit-learn-general
Subject: Re: [Scikit-learn-general] Logistic regression coefficients analysis

It is correct to assume that a positive coefficient contributes positively to a
decision.

However, because the features are interdependent, the raw strength of a feature
isn't always straightforward to interpret. For example, it might give a big
positive coefficient to "Tel" and a similar negative coefficient to "Aviv", but
sine these almost always appear together, their presence has little effect.

The usefulness of the weights also depends on the scale of your features. So if
you use raw term frequency, a small positive coefficient may have great effect
for a word that (when it appears) appears many times throughout a document; if
you use tf.idf, a feature with high DF can attract a high coefficient, but
contribute little to the overall decision (although L1 regularisation might
help avoid this).

- Joel

On 20 February 2014 06:57, Pavel Soriano
<sorianopa...@gmail.com<mailto:sorianopa...@gmail.com>> wrote:
Hello scikit!

I need some insights into what I am doing.

Currently I am doing a text classifier (2 classes) using unigrams (word level)
and some writing style features. I am using a Logistic Regression model, with
L1 regularization. I have a decent performance (around .70 f-measure) for the
given corpus.

I would like to make an error analysis, that is, to study the incorrectly
classified documents and get some information from them, in order to maybe
develop some rules to treat these cases or improve/modify my features.

I thought about using the values of the coefficients of the fitted logit
equation to get a glimpse of what words in the vocabulary, or what style
features, affect the most to the classification decision. Is it correct to
assume that if the coefficient of a variable is positive, this means a higher
importance of said variable towards "positive" label? If it is near to one, is
almost 50/50 for the final classification, and if it is negative, it
contributes towards the "negative" class?

I have read about logit regression interpretation (Ref
1<http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm>,Ref
2<http://www.appstate.edu/~whiteheadjc/service/logit/intro.htm#interp>), and so
it seems this is a correct way to interpret the coefficients, but I would like
to be sure.

If you have any other ideas of how to perform a different error analysis,
please share them with me.
Thanks for the input!

Pavel SORIANO

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Logistic regression coefficients analysis

Reply via email to