Stuart,
Yes the data is quite imbalanced (this is what I meant by p(success) < .05 )
To be clear, I calculate
\sum_i \hat{y_i} = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1))
and compare that number to the observed number of success. I find the predicted
number to always be higher (I think, because of the intercept).
I was not aware of a bias for imbalanced data. Can you tell me more? Why does
it not appear with the relaxed regularization? Also, using the same data with
statsmodels LR, which has no regularization, this doesn't seem to be a problem.
Any suggestions for how I could fix this are welcome.
Thank you
On Dec 15, 2016, at 4:41 PM, Stuart Reynolds
<[email protected]<mailto:[email protected]>> wrote:
LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is
there one class that has a much smaller prevalence in the data that the other)?
On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed
<[email protected]<mailto:[email protected]>> wrote:
I just tried it and it did not appear to change the results at all?
I ran it as follows:
1) Normalize dummy variables (by subtracting median) to make a matrix of about
10000 x 5
2) For each of the 1000 output variables:
a. Each output variable uses the same dummy variables, but not all settings of
covariates are observed for all output variables. So I create the design matrix
using patsy per output variable to include pairwise interactions. Then, I have
an around 10000 x 350 design matrix , and a matrix I call “success_fail” that
has for each setting the number of success and number of fail, so it is of size
10000 x 2
b. Run regression using:
skdesign = np.vstack((design,design))
sklabel = np.hstack((np.ones(success_fail.shape[0]),
np.zeros(success_fail.shape[0])))
skweight = np.hstack((success_fail['success'], success_fail['fail']))
logregN = linear_model.LogisticRegression(C=1,
solver= 'lbfgs',fit_intercept=False)
logregN.fit(skdesign, sklabel, sample_weight=skweight)
On Dec 15, 2016, at 2:16 PM, Alexey Dral
<[email protected]<mailto:[email protected]>> wrote:
Could you try to normalize dataset after feature dummy encoding and see if it
is reproducible behavior?
2016-12-15 22:03 GMT+03:00 Rachel Melamed
<[email protected]<mailto:[email protected]>>:
Thanks for the reply. The covariates (“X") are all dummy/categorical
variables. So I guess no, nothing is normalized.
On Dec 15, 2016, at 1:54 PM, Alexey Dral
<[email protected]<mailto:[email protected]>> wrote:
Hi Rachel,
Do you have your data normalized?
2016-12-15 20:21 GMT+03:00 Rachel Melamed
<[email protected]<mailto:[email protected]>>:
Hi all,
Does anyone have any suggestions for this problem:
http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results
I am running around 1000 similar logistic regressions, with the same covariates
but slightly different data and response variables. All of my response
variables have a sparse successes (p(success) < .05 usually).
I noticed that with the regularized regression, the results are consistently
biased to predict more "successes" than is observed in the training data. When
I relax the regularization, this bias goes away. The bias observed is
unacceptable for my use case, but the more-regularized model does seem a bit
better.
Below, I plot the results for the 1000 different regressions for 2 different
values of C: [results for the different regressions for 2 different values of
C] <https://i.stack.imgur.com/1cbrC.png>
I looked at the parameter estimates for one of these regressions: below each
point is one parameter. It seems like the intercept (the point on the bottom
left) is too high for the C=1 model. [enter image description here]
<https://i.stack.imgur.com/NTFOY.png>
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn