I just tried it and it did not appear to change the results at all?
I ran it as follows:
1) Normalize dummy variables (by subtracting median) to make a matrix of about
10000 x 5
2) For each of the 1000 output variables:
a. Each output variable uses the same dummy variables, but not all settings of
covariates are observed for all output variables. So I create the design matrix
using patsy per output variable to include pairwise interactions. Then, I have
an around 10000 x 350 design matrix , and a matrix I call “success_fail” that
has for each setting the number of success and number of fail, so it is of size
10000 x 2
b. Run regression using:
skdesign = np.vstack((design,design))
sklabel = np.hstack((np.ones(success_fail.shape[0]),
np.zeros(success_fail.shape[0])))
skweight = np.hstack((success_fail['success'], success_fail['fail']))
logregN = linear_model.LogisticRegression(C=1,
solver= 'lbfgs',fit_intercept=False)
logregN.fit(skdesign, sklabel, sample_weight=skweight)
On Dec 15, 2016, at 2:16 PM, Alexey Dral
<[email protected]<mailto:[email protected]>> wrote:
Could you try to normalize dataset after feature dummy encoding and see if it
is reproducible behavior?
2016-12-15 22:03 GMT+03:00 Rachel Melamed
<[email protected]<mailto:[email protected]>>:
Thanks for the reply. The covariates (“X") are all dummy/categorical
variables. So I guess no, nothing is normalized.
On Dec 15, 2016, at 1:54 PM, Alexey Dral
<[email protected]<mailto:[email protected]>> wrote:
Hi Rachel,
Do you have your data normalized?
2016-12-15 20:21 GMT+03:00 Rachel Melamed
<[email protected]<mailto:[email protected]>>:
Hi all,
Does anyone have any suggestions for this problem:
http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results
I am running around 1000 similar logistic regressions, with the same covariates
but slightly different data and response variables. All of my response
variables have a sparse successes (p(success) < .05 usually).
I noticed that with the regularized regression, the results are consistently
biased to predict more "successes" than is observed in the training data. When
I relax the regularization, this bias goes away. The bias observed is
unacceptable for my use case, but the more-regularized model does seem a bit
better.
Below, I plot the results for the 1000 different regressions for 2 different
values of C: [results for the different regressions for 2 different values of
C] <https://i.stack.imgur.com/1cbrC.png>
I looked at the parameter estimates for one of these regressions: below each
point is one parameter. It seems like the intercept (the point on the bottom
left) is too high for the C=1 model. [enter image description here]
<https://i.stack.imgur.com/NTFOY.png>
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn