Stuart,
Yes the data is quite imbalanced (this is what I meant by p(success) < .05 )

To be clear, I calculate
\sum_i \hat{y_i} = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1))
and compare that number to the observed number of success. I find the predicted 
number to always be higher (I think, because of the intercept).

I was not aware of a bias for imbalanced data.  Can you tell me more? Why does 
it not appear with the relaxed regularization? Also, using the same data with 
statsmodels LR, which has no regularization, this doesn't seem to be a problem. 
Any suggestions for how I could fix this are welcome.

Thank you

On Dec 15, 2016, at 4:41 PM, Stuart Reynolds 
<stu...@stuartreynolds.net<mailto:stu...@stuartreynolds.net>> wrote:

LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is 
there one class that has a much smaller prevalence in the data that the other)?

On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed 
<mela...@uchicago.edu<mailto:mela...@uchicago.edu>> wrote:
I just tried it and it did not appear to change the results at all?
I ran it as follows:
1) Normalize dummy variables (by subtracting median) to make a matrix of about 
10000 x 5

2) For each of the 1000 output variables:
a. Each output variable uses the same dummy variables, but not all settings of 
covariates are observed for all output variables. So I create the design matrix 
using patsy per output variable to include pairwise interactions.  Then, I have 
an around 10000 x 350 design matrix , and a matrix I call “success_fail” that 
has for each setting the number of success and number of fail, so it is of size 
10000 x 2

b. Run regression using:
    skdesign = np.vstack((design,design))
    sklabel = np.hstack((np.ones(success_fail.shape[0]),
np.zeros(success_fail.shape[0])))
    skweight = np.hstack((success_fail['success'], success_fail['fail']))

        logregN = linear_model.LogisticRegression(C=1,
                                    solver= 'lbfgs',fit_intercept=False)
        logregN.fit(skdesign, sklabel, sample_weight=skweight)


On Dec 15, 2016, at 2:16 PM, Alexey Dral 
<aad...@gmail.com<mailto:aad...@gmail.com>> wrote:

Could you try to normalize dataset after feature dummy encoding and see if it 
is reproducible behavior?

2016-12-15 22:03 GMT+03:00 Rachel Melamed 
<mela...@uchicago.edu<mailto:mela...@uchicago.edu>>:
Thanks for the reply.  The covariates (“X") are all dummy/categorical 
variables.  So I guess no, nothing is normalized.

On Dec 15, 2016, at 1:54 PM, Alexey Dral 
<aad...@gmail.com<mailto:aad...@gmail.com>> wrote:

Hi Rachel,

Do you have your data normalized?

2016-12-15 20:21 GMT+03:00 Rachel Melamed 
<mela...@uchicago.edu<mailto:mela...@uchicago.edu>>:
Hi all,
Does anyone have any suggestions for this problem:
http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results


I am running around 1000 similar logistic regressions, with the same covariates 
but slightly different data and response variables. All of my response 
variables have a sparse successes (p(success) < .05 usually).

I noticed that with the regularized regression, the results are consistently 
biased to predict more "successes" than is observed in the training data. When 
I relax the regularization, this bias goes away. The bias observed is 
unacceptable for my use case, but the more-regularized model does seem a bit 
better.

Below, I plot the results for the 1000 different regressions for 2 different 
values of C: [results for the different regressions for 2 different values of 
C] <https://i.stack.imgur.com/1cbrC.png>

I looked at the parameter estimates for one of these regressions: below each 
point is one parameter. It seems like the intercept (the point on the bottom 
left) is too high for the C=1 model. [enter image description here] 
<https://i.stack.imgur.com/NTFOY.png>


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn




--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn




--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org<mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to