Hello everyone,
I have spotted some strange behaviour while generating stratified
shuffled splits. For certain values of the `test_size` parameter and
for imbalanced classes, some samples might get ignored and the
training indices might not include one class.
For example, using sklearn 0.11 and Python 2.7:
from sklearn.cross_validation import StratifiedShuffleSplit
> import numpy as np
>
> y = np.hstack(([-1] * 800, [1] * 50))
> tr_idx, te_idx = iter(StratifiedShuffleSplit(y, 1, test_size=0.3)).next()
>
> print np.unique(y[tr_idx])
# Prints [-1]. I don't get any sample from class `1`.
>
>
print len(tr_idx) + len(te_idx)
>
# Prints 808. Some samples are lost.
>
Part of the error is due to rounding errors. If 1. / test_size is
an integer (e.g., test_size=0.25), everything goes fine.
Dan.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general