Hi, Chris, have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven’t done so, yet.
Best, Sebastian > On Aug 15, 2016, at 5:27 PM, Chris Cameron <[email protected]> wrote: > > Hi all, > > Using the same X and y values sklearn.linear_model.LogisticRegression.fit() > is providing me with inconsistent results. > > The documentation for sklearn.linear_model.LogisticRegression states that "It > is thus not uncommon, to have slightly different results for the same input > data.” I am experiencing this, however the fix of using a smaller “tol” > parameter isn’t providing me with consistent fit. > > The code I’m using: > > def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y > df_train, df_test = train_test_split(logreg_x, random_state=0) > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = > LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, > y_train) > predicted = log_reg_fit.predict(df_test) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > > I’ve gone out of my way to be sure the test and train data is the same for > each run, so I don’t think there should be random shuffling going on. > > Example output: > --- > log_run(df_save, y) > Out[32]: [0.027777777777777728, 0.53333333333333333] > > log_run(df_save, y) > Out[33]: [0.027777777777777728, 0.53333333333333333] > > log_run(df_save, y) > Out[34]: [0.11347517730496456, 0.58333333333333337] > > log_run(df_save, y) > Out[35]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[36]: [-0.07407407407407407, 0.51666666666666672] > > log_run(df_save, y) > Out[37]: [0.042553191489361743, 0.55000000000000004] > > A little information on the problem DataFrame: > --- > len(df_save) > Out[40]: 240 > > len(df_save.columns) > Out[41]: 18 > > > If I omit this particular column the Kappa no longer fluctuates: > > df_save[‘abc'].head() > Out[42]: > 0 0.026316 > 1 0.333333 > 2 0.015152 > 3 0.010526 > 4 0.125000 > Name: abc, dtype: float64 > > > Does anyone have ideas on how I can figure this out? Is there some > randomness/shuffling still going on I missed? > > > Thanks! > Chris > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
