Re: [scikit-learn] biased predictions in logistic regression

Rachel Melamed Sun, 18 Dec 2016 14:06:06 -0800

Josef!  Thank you!  You are a gift to the python statistics world.  I don’t 
know why I did not think of that. That is the answer.
Rachel

On Dec 18, 2016, at 10:09 AM, [email protected]<mailto:[email protected]> 
wrote:

On Sat, Dec 17, 2016 at 10:25 PM, Rachel Melamed 
<[email protected]<mailto:[email protected]>> wrote:
Hi Sean, Sebastian, Alexey (and Josef),
I’m not sure I fully understand what normalizing a dummy should consist of, so 
please let me know if I am interpreting your suggestion right.  I believe I 
can’t use the StandardScaler since I  am using grouped data. As I mentioned, 
I’m using a matrix “success_fail” containing the number of observations of each 
outcome per row of the  design matrix.  I tried to implement your suggestions 
by making my design matrix per each regression as previously, using patsy, but 
then normalizing:

        darray = np.array(design)
weights = np.tile(success_fail.sum(axis=1)/success_fail.sum().sum(),
(darray.shape[1],1)).transpose()

        dmean = (darray*weights).sum(axis=0)
        dvar = (((darray - dmean)**2)*weights).sum(axis=0)**.5
        dvar[dvar==0] = 1
        design_norm = (darray - dmean) / dvar
design_norm[:,0] = 1 ## intercept stays at 1

Then I use the normalized version of the design matrix as input to the 
regression. It seems like the bias is still there, though the results are a bit 
different (see plot)
Is this what you were suggesting? I’m also not sure I understand why the error 
would be higher on the training data for the less-regularized setting.
Thanks again
Rachel

Doing a partial check on the math:
AFAICS: The estimating equation or gradient condition for the intercept is 
unchanged even when the other parameters are penalized. This means that in the 
training sample the fraction of success or failures should coincide with the 
predicted probabilities of success or failures. The intercept should compensate 
for the penalization of the other parameters and for any transformation or 
standardization of the other explanatory variables.

AFAICS from your code snippets. You are using the intercept as part of X and 
fit_intercept=False.
This means your intercept is penalized. If you use the built-in 
fit_intercept=True, then the intercept is not penalized and the bias should go 
away.

Josef

On Dec 15, 2016, at 5:05 PM, Sean Violante 
<[email protected]<mailto:[email protected]>> wrote:

Sorry just saw you are not using the liblinear solver,  agree with Sebastian,  
you should subtract mean not median

On 15 Dec 2016 11:02 pm, "Sean Violante" 
<[email protected]<mailto:[email protected]>> wrote:
The problem is the (stupid!)  liblinear solver that also penalises the 
intercept (in regularisation) . Use a different solver or change  the 
intercept_scaling parameter

On 15 Dec 2016 10:44 pm, "Sebastian Raschka" 
<[email protected]<mailto:[email protected]>> wrote:
Subtracting the median wouldn’t result in normalizing the usual sense, since 
subtracting a constant just shifts the values by a constant. Instead, for 
logistic regression & most optimizers, I would recommend subtracting the mean 
to center the features at mean zero and divide by the standard deviation to get 
“z” scores (e.g., this can be done by the StandardScaler()).

Best,
Sebastian

> On Dec 15, 2016, at 4:02 PM, Rachel Melamed 
> <[email protected]<mailto:[email protected]>> wrote:
>
> I just tried it and it did not appear to change the results at all?
> I ran it as follows:
> 1) Normalize dummy variables (by subtracting median) to make a matrix of 
> about 10000 x 5
>
> 2) For each of the 1000 output variables:
> a. Each output variable uses the same dummy variables, but not all settings 
> of covariates are observed for all output variables. So I create the design 
> matrix using patsy per output variable to include pairwise interactions.  
> Then, I have an around 10000 x 350 design matrix , and a matrix I call 
> “success_fail” that has for each setting the number of success and number of 
> fail, so it is of size 10000 x 2
>
> b. Run regression using:
>
> skdesign = np.vstack((design,design))
>
> sklabel = np.hstack((np.ones(success_fail.shape[0]),
> np.zeros(success_fail.shape[0])))
>
> skweight = np.hstack((success_fail['success'], success_fail['fail']))
>
>         logregN = linear_model.LogisticRegression(C=1,
>                                     solver= 'lbfgs',fit_intercept=False)
>         logregN.fit(skdesign, sklabel, sample_weight=skweight)
>
>
>> On Dec 15, 2016, at 2:16 PM, Alexey Dral 
>> <[email protected]<mailto:[email protected]>> wrote:
>>
>> Could you try to normalize dataset after feature dummy encoding and see if 
>> it is reproducible behavior?
>>
>> 2016-12-15 22:03 GMT+03:00 Rachel Melamed 
>> <[email protected]<mailto:[email protected]>>:
>> Thanks for the reply.  The covariates (“X") are all dummy/categorical 
>> variables.  So I guess no, nothing is normalized.
>>
>>> On Dec 15, 2016, at 1:54 PM, Alexey Dral 
>>> <[email protected]<mailto:[email protected]>> wrote:
>>>
>>> Hi Rachel,
>>>
>>> Do you have your data normalized?
>>>
>>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed 
>>> <[email protected]<mailto:[email protected]>>:
>>> Hi all,
>>> Does anyone have any suggestions for this problem:
>>> http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results
>>>
>>> I am running around 1000 similar logistic regressions, with the same 
>>> covariates but slightly different data and response variables. All of my 
>>> response variables have a sparse successes (p(success) < .05 usually).
>>>
>>> I noticed that with the regularized regression, the results are 
>>> consistently biased to predict more "successes" than is observed in the 
>>> training data. When I relax the regularization, this bias goes away. The 
>>> bias observed is unacceptable for my use case, but the more-regularized 
>>> model does seem a bit better.
>>>
>>> Below, I plot the results for the 1000 different regressions for 2 
>>> different values of C:
>>>
>>> I looked at the parameter estimates for one of these regressions: below 
>>> each point is one parameter. It seems like the intercept (the point on the 
>>> bottom left) is too high for the C=1 model.
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]<mailto:[email protected]>
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>>
>>> --
>>> Yours sincerely,
>>> Alexey A. Dral
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]<mailto:[email protected]>
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]<mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>> --
>> Yours sincerely,
>> Alexey A. Dral
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]<mailto:[email protected]>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]<mailto:[email protected]>
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]<mailto:[email protected]>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] biased predictions in logistic regression

Reply via email to