LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is there one class that has a much smaller prevalence in the data that the other)?
On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed <[email protected]> wrote: > I just tried it and it did not appear to change the results at all? > I ran it as follows: > 1) Normalize dummy variables (by subtracting median) to make a matrix of > about 10000 x 5 > > 2) For each of the 1000 output variables: > a. Each output variable uses the same dummy variables, but not all > settings of covariates are observed for all output variables. So I create > the design matrix using patsy per output variable to include pairwise > interactions. Then, I have an around 10000 x 350 design matrix , and a > matrix I call “success_fail” that has for each setting the number of > success and number of fail, so it is of size 10000 x 2 > > b. Run regression using: > skdesign = np.vstack((design,design)) > sklabel = np.hstack((np.ones(success_fail.shape[0]), > np.zeros(success_fail.shape[0]))) > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > logregN = linear_model.LogisticRegression(C=1, > solver= 'lbfgs',fit_intercept=False) > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > > On Dec 15, 2016, at 2:16 PM, Alexey Dral <[email protected]> wrote: > > Could you try to normalize dataset after feature dummy encoding and see if > it is reproducible behavior? > > 2016-12-15 22:03 GMT+03:00 Rachel Melamed <[email protected]>: > >> Thanks for the reply. The covariates (“X") are all dummy/categorical >> variables. So I guess no, nothing is normalized. >> >> On Dec 15, 2016, at 1:54 PM, Alexey Dral <[email protected]> wrote: >> >> Hi Rachel, >> >> Do you have your data normalized? >> >> 2016-12-15 20:21 GMT+03:00 Rachel Melamed <[email protected]>: >> >>> Hi all, >>> Does anyone have any suggestions for this problem: >>> http://stackoverflow.com/questions/41125342/sklearn-logistic >>> -regression-gives-biased-results >>> >>> I am running around 1000 similar logistic regressions, with the same >>> covariates but slightly different data and response variables. All of my >>> response variables have a sparse successes (p(success) < .05 usually). >>> >>> I noticed that with the regularized regression, the results are >>> consistently biased to predict more "successes" than is observed in the >>> training data. When I relax the regularization, this bias goes away. The >>> bias observed is unacceptable for my use case, but the more-regularized >>> model does seem a bit better. >>> >>> Below, I plot the results for the 1000 different regressions for 2 >>> different values of C: [image: results for the different regressions >>> for 2 different values of C] <https://i.stack.imgur.com/1cbrC.png> >>> >>> I looked at the parameter estimates for one of these regressions: below >>> each point is one parameter. It seems like the intercept (the point on the >>> bottom left) is too high for the C=1 model. [image: enter image >>> description here] <https://i.stack.imgur.com/NTFOY.png> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> [email protected] >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Yours sincerely, >> Alexey A. Dral >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Yours sincerely, > Alexey A. Dral > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
