So it appears that the test check_classifiers_train() (
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/utils/estimator_checks.py#L1079)
does *not* use the iris dataset after all:

X_m, y_m = make_blobs(n_samples=300, random_state=0)
X_m, y_m = shuffle(X_m, y_m, random_state=7)
X_m = StandardScaler().fit_transform(X_m)

But, this also explains why my classifier only gets accuracy of only 31%.
My classifier that I’m trying to build to contribute to scikit-learn-contrib
is designed to be used on NLP data where the features are *non-negative*
counts: https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf

Interestingly enough, this classifier reports 100% accuracy on the iris
dataset (when last 10% is used for testing). But again, the main purpose of
this classifier is in NLP cases.

So @andreas mentioned that this can be relaxed “if there’s a good reason.”
Does the above situation qualify?

-M
​

On Thu, Oct 12, 2017 at 11:27 AM, Michael Capizzi <
mcapi...@email.arizona.edu> wrote:

> Thanks @andreas, for your comments, especially the info that it's the
> `iris` dataset.  I have to dig a bit deeper to see what's going on with the
> performance there.  But now that I know it's `iris`, I can try to recreate.
>
> -M
>
> On Thu, Oct 12, 2017 at 12:01 AM, Andreas Mueller <t3k...@gmail.com>
> wrote:
>
>> Yes, it's pretty empirical, and with the estimator tags PR (
>> https://github.com/scikit-learn/scikit-learn/pull/8022) we will be able
>> to relax it if there's a good reason you're not passing.
>> But the dataset is pretty trivial (iris), and you're getting chance
>> performance (it's a balanced three class problem). So that is not a great
>> sign for your estimator.
>>
>>
>> On 10/11/2017 07:09 PM, Guillaume Lemaître wrote:
>>
>> Not sure 100% but this is an integration/sanity check since all
>> classifiers are supposed to predict quite well and data used to train.
>> This is true that 83% is empirical but it allows to spot any changes done
>> in the algorithms even if the unit tests are passing for some reason.
>>
>> On 11 October 2017 at 18:52, Michael Capizzi <mcapi...@email.arizona.edu>
>> wrote:
>>
>>> I’m wondering if anyone can identify the purpose of this test:
>>> check_classifiers_train(), specifically this line:
>>> https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/s
>>> klearn/utils/estimator_checks.py#L1106
>>>
>>> My custom classifier (which I’m hoping to submit to scikit-learn-contrib)
>>> is failing this test:
>>>
>>>   File 
>>> "/Users/mcapizzi/miniconda3/envs/nb_plus_svm/lib/python3.6/site-packages/sklearn/utils/estimator_checks.py",
>>>  line 1106, in check_classifiers_train
>>>     assert_greater(accuracy_score(y, y_pred), 0.83)
>>> AssertionError: 0.31333333333333335 not greater than 0.83
>>>
>>> And while it’s disturbing that my classifier is getting 31% accuracy
>>> when, clearly, the test writer expects it to be in the upper-80s, I’m not
>>> sure I understand why that would be a test condition.
>>>
>>> Thanks for any insight.
>>> ​
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>>
>> _______________________________________________
>> scikit-learn mailing 
>> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to