hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist?
Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I’d recommend setting > y_train = df_train.pass_fail.values > y_test = df_test.pass_fail.values instead of > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() Also, try passing NumPy arrays to the fit method: > log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train) and > predicted = log_reg_fit.predict(df_test.values) and so forth. > On Aug 15, 2016, at 6:00 PM, Chris Cameron <[email protected]> wrote: > > Sebastian, > > That doesn’t do it. With the function: > > def log_run(logreg_x, logreg_y): > logreg_x['pass_fail'] = logreg_y > df_train, df_test = train_test_split(logreg_x, random_state=0) > y_train = df_train.pass_fail.as_matrix() > y_test = df_test.pass_fail.as_matrix() > del(df_train['pass_fail']) > del(df_test['pass_fail']) > log_reg_fit = LogisticRegression(class_weight='balanced', > tol=0.000000001, > random_state=0).fit(df_train, y_train) > predicted = log_reg_fit.predict(df_test) > accuracy = accuracy_score(y_test, predicted) > kappa = cohen_kappa_score(y_test, predicted) > > return [kappa, accuracy] > > I’m still seeing: > log_run(df_save, y) > Out[7]: [-0.054421768707483005, 0.48333333333333334] > > log_run(df_save, y) > Out[8]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[9]: [0.042553191489361743, 0.55000000000000004] > > log_run(df_save, y) > Out[10]: [0.027777777777777728, 0.53333333333333333] > > > Chris > >> On Aug 15, 2016, at 3:42 PM, [email protected] wrote: >> >> Hi, Chris, >> have you set the random seed to a specific, contant integer value? Note that >> the default in LogisticRegression is random_state=None. Setting it to some >> arbitrary number like 123 may help if you haven’t done so, yet. >> >> Best, >> Sebastian >> >> >> >>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <[email protected]> wrote: >>> >>> Hi all, >>> >>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() >>> is providing me with inconsistent results. >>> >>> The documentation for sklearn.linear_model.LogisticRegression states that >>> "It is thus not uncommon, to have slightly different results for the same >>> input data.” I am experiencing this, however the fix of using a smaller >>> “tol” parameter isn’t providing me with consistent fit. >>> >>> The code I’m using: >>> >>> def log_run(logreg_x, logreg_y): >>> logreg_x['pass_fail'] = logreg_y >>> df_train, df_test = train_test_split(logreg_x, random_state=0) >>> y_train = df_train.pass_fail.as_matrix() >>> y_test = df_test.pass_fail.as_matrix() >>> del(df_train['pass_fail']) >>> del(df_test['pass_fail']) >>> log_reg_fit = >>> LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, >>> y_train) >>> predicted = log_reg_fit.predict(df_test) >>> accuracy = accuracy_score(y_test, predicted) >>> kappa = cohen_kappa_score(y_test, predicted) >>> >>> return [kappa, accuracy] >>> >>> >>> I’ve gone out of my way to be sure the test and train data is the same for >>> each run, so I don’t think there should be random shuffling going on. >>> >>> Example output: >>> --- >>> log_run(df_save, y) >>> Out[32]: [0.027777777777777728, 0.53333333333333333] >>> >>> log_run(df_save, y) >>> Out[33]: [0.027777777777777728, 0.53333333333333333] >>> >>> log_run(df_save, y) >>> Out[34]: [0.11347517730496456, 0.58333333333333337] >>> >>> log_run(df_save, y) >>> Out[35]: [0.042553191489361743, 0.55000000000000004] >>> >>> log_run(df_save, y) >>> Out[36]: [-0.07407407407407407, 0.51666666666666672] >>> >>> log_run(df_save, y) >>> Out[37]: [0.042553191489361743, 0.55000000000000004] >>> >>> A little information on the problem DataFrame: >>> --- >>> len(df_save) >>> Out[40]: 240 >>> >>> len(df_save.columns) >>> Out[41]: 18 >>> >>> >>> If I omit this particular column the Kappa no longer fluctuates: >>> >>> df_save[‘abc'].head() >>> Out[42]: >>> 0 0.026316 >>> 1 0.333333 >>> 2 0.015152 >>> 3 0.010526 >>> 4 0.125000 >>> Name: abc, dtype: float64 >>> >>> >>> Does anyone have ideas on how I can figure this out? Is there some >>> randomness/shuffling still going on I missed? >>> >>> >>> Thanks! >>> Chris >>> _______________________________________________ >>> scikit-learn mailing list >>> [email protected] >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
