So it appears that the test check_classifiers_train() ( https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/utils/estimator_checks.py#L1079) does *not* use the iris dataset after all:
X_m, y_m = make_blobs(n_samples=300, random_state=0) X_m, y_m = shuffle(X_m, y_m, random_state=7) X_m = StandardScaler().fit_transform(X_m) But, this also explains why my classifier only gets accuracy of only 31%. My classifier that I’m trying to build to contribute to scikit-learn-contrib is designed to be used on NLP data where the features are *non-negative* counts: https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf Interestingly enough, this classifier reports 100% accuracy on the iris dataset (when last 10% is used for testing). But again, the main purpose of this classifier is in NLP cases. So @andreas mentioned that this can be relaxed “if there’s a good reason.” Does the above situation qualify? -M On Thu, Oct 12, 2017 at 11:27 AM, Michael Capizzi < mcapi...@email.arizona.edu> wrote: > Thanks @andreas, for your comments, especially the info that it's the > `iris` dataset. I have to dig a bit deeper to see what's going on with the > performance there. But now that I know it's `iris`, I can try to recreate. > > -M > > On Thu, Oct 12, 2017 at 12:01 AM, Andreas Mueller <t3k...@gmail.com> > wrote: > >> Yes, it's pretty empirical, and with the estimator tags PR ( >> https://github.com/scikit-learn/scikit-learn/pull/8022) we will be able >> to relax it if there's a good reason you're not passing. >> But the dataset is pretty trivial (iris), and you're getting chance >> performance (it's a balanced three class problem). So that is not a great >> sign for your estimator. >> >> >> On 10/11/2017 07:09 PM, Guillaume Lemaître wrote: >> >> Not sure 100% but this is an integration/sanity check since all >> classifiers are supposed to predict quite well and data used to train. >> This is true that 83% is empirical but it allows to spot any changes done >> in the algorithms even if the unit tests are passing for some reason. >> >> On 11 October 2017 at 18:52, Michael Capizzi <mcapi...@email.arizona.edu> >> wrote: >> >>> I’m wondering if anyone can identify the purpose of this test: >>> check_classifiers_train(), specifically this line: >>> https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/s >>> klearn/utils/estimator_checks.py#L1106 >>> >>> My custom classifier (which I’m hoping to submit to scikit-learn-contrib) >>> is failing this test: >>> >>> File >>> "/Users/mcapizzi/miniconda3/envs/nb_plus_svm/lib/python3.6/site-packages/sklearn/utils/estimator_checks.py", >>> line 1106, in check_classifiers_train >>> assert_greater(accuracy_score(y, y_pred), 0.83) >>> AssertionError: 0.31333333333333335 not greater than 0.83 >>> >>> And while it’s disturbing that my classifier is getting 31% accuracy >>> when, clearly, the test writer expects it to be in the upper-80s, I’m not >>> sure I understand why that would be a test condition. >>> >>> Thanks for any insight. >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> >> >> _______________________________________________ >> scikit-learn mailing >> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn