In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning.
Not a bug. An expected behavior. Sent from my phone. Please forgive brevity and mis spelling On Aug 16, 2016, 18:17, at 18:17, Chris Cameron <[email protected]> wrote: >Thank you everyone for your help. The short version of this email is >that changing the solver from ‘liblinear’ to ‘sag’ fixed my problem - >but only if I upped “max_iter” to 1000. > > >Longer version - >Without max_iter=1000, I would get the warning: >ConvergenceWarning: The max_iter was reached which means the coef_ did >not converge > >I have some columns in my data that have a huge range of values. Using >“liblinear”, if I transformed those columns, causing the range to be >smaller, the results would be consistent every time. > >This is the function I ended up using - >def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y >df_train, df_test, y_train, y_test = train_test_split(logreg_x, >logreg_y, random_state=0) > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = LogisticRegression(class_weight='balanced', > tol=0.00000001, > random_state=8, > solver='sag', > max_iter=1000).fit(df_train.values, y_train) > predicted = log_reg_fit.predict(df_test.values) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > >Thank you again for the help, > >Chris > >> On Aug 15, 2016, at 4:26 PM, [email protected] wrote: >> >> hm, was worth a try. What happens if you change the solver to >something else than liblinear, does this issue still persist? >> >> >> Btw. scikit-learn works with NumPy arrays, not NumPy matrices. >Probably unrelated to your issue, I’d recommend setting >> >>> y_train = df_train.pass_fail.values >>> y_test = df_test.pass_fail.values >> >> instead of >> >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >> >> >> Also, try passing NumPy arrays to the fit method: >> >>> log_reg_fit = LogisticRegression(...).fit(df_train.values, >y_train) >> >> and >> >>> predicted = log_reg_fit.predict(df_test.values) >> >> and so forth. >> >> >> >> >> >>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <[email protected]> wrote: >>> >>> Sebastian, >>> >>> That doesn’t do it. With the function: >>> >>> def log_run(logreg_x, logreg_y): >>> logreg_x['pass_fail'] = logreg_y >>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >>> del(df_train['pass_fail']) >>> del(df_test['pass_fail']) >>> log_reg_fit = LogisticRegression(class_weight='balanced', >>> tol=0.000000001, >>> random_state=0).fit(df_train, >y_train) >>> predicted = log_reg_fit.predict(df_test) >>> accuracy = accuracy_score(y_test, predicted) >>> kappa = cohen_kappa_score(y_test, predicted) >>> >>> return [kappa, accuracy] >>> >>> I’m still seeing: >>> log_run(df_save, y) >>> Out[7]: [-0.054421768707483005, 0.48333333333333334] >>> >>> log_run(df_save, y) >>> Out[8]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[9]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[10]: [0.027777777777777728, 0.53333333333333333] >>> >>> >>> Chris >>> >>>> On Aug 15, 2016, at 3:42 PM, [email protected] wrote: >>>> >>>> Hi, Chris, >>>> have you set the random seed to a specific, contant integer value? >Note that the default in LogisticRegression is random_state=None. >Setting it to some arbitrary number like 123 may help if you haven’t >done so, yet. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> >>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <[email protected]> >wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Using the same X and y values >sklearn.linear_model.LogisticRegression.fit() is providing me with >inconsistent results. >>>>> >>>>> The documentation for sklearn.linear_model.LogisticRegression >states that "It is thus not uncommon, to have slightly different >results for the same input data.” I am experiencing this, however the >fix of using a smaller “tol” parameter isn’t providing me with >consistent fit. >>>>> >>>>> The code I’m using: >>>>> >>>>> def log_run(logreg_x, logreg_y): >>>>> logreg_x['pass_fail'] = logreg_y >>>>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>>>> y_train = df_train.pass_fail.as_matrix() >>>>> y_test = df_test.pass_fail.as_matrix() >>>>> del(df_train['pass_fail']) >>>>> del(df_test['pass_fail']) >>>>> log_reg_fit = >LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, >y_train) >>>>> predicted = log_reg_fit.predict(df_test) >>>>> accuracy = accuracy_score(y_test, predicted) >>>>> kappa = cohen_kappa_score(y_test, predicted) >>>>> >>>>> return [kappa, accuracy] >>>>> >>>>> >>>>> I’ve gone out of my way to be sure the test and train data is the >same for each run, so I don’t think there should be random shuffling >going on. >>>>> >>>>> Example output: >>>>> --- >>>>> log_run(df_save, y) >>>>> Out[32]: [0.027777777777777728, 0.53333333333333333] >>>>> >>>>> log_run(df_save, y) >>>>> Out[33]: [0.027777777777777728, 0.53333333333333333] >>>>> >>>>> log_run(df_save, y) >>>>> Out[34]: [0.11347517730496456, 0.58333333333333337] >>>>> >>>>> log_run(df_save, y) >>>>> Out[35]: [0.042553191489361743, 0.55000000000000004] >>>>> >>>>> log_run(df_save, y) >>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672] >>>>> >>>>> log_run(df_save, y) >>>>> Out[37]: [0.042553191489361743, 0.55000000000000004] >>>>> >>>>> A little information on the problem DataFrame: >>>>> --- >>>>> len(df_save) >>>>> Out[40]: 240 >>>>> >>>>> len(df_save.columns) >>>>> Out[41]: 18 >>>>> >>>>> >>>>> If I omit this particular column the Kappa no longer fluctuates: >>>>> >>>>> df_save[‘abc'].head() >>>>> Out[42]: >>>>> 0 0.026316 >>>>> 1 0.333333 >>>>> 2 0.015152 >>>>> 3 0.010526 >>>>> 4 0.125000 >>>>> Name: abc, dtype: float64 >>>>> >>>>> >>>>> Does anyone have ideas on how I can figure this out? Is there some >randomness/shuffling still going on I missed? >>>>> >>>>> >>>>> Thanks! >>>>> Chris >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> [email protected] >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> [email protected] >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> [email protected] >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn > >_______________________________________________ >scikit-learn mailing list >[email protected] >https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
