Hi Maciek, Thanks for suggestion, I think the problem indeed is related to the StratifiedKFold because if I use KFold instead the code works fine. However, if I print StratifiedKFold object it looks fine to me:
sklearn.cross_validation.StratifiedKFold(labels=[ 5.43 8.74 8.1 6.55 7.66 6.52 8.6 7.1 6.4 8.05 7.89 6.68 8.06 6.17 5.5 7.96 5.78 6. 7.74 5.83 6.51 6.31 6.68 9.22 6.07 7.06 7.12 8.64 5.72 6.4 7.64 5.74 7.41 6.49 6.81 7.1 7.66 6.68 7.05 6.28 5.49 6.35 6.9 6.2 7.51 5.65 9.3 5.84 6.92 5.75 6.92 8.8 7.04 5.81 5.73 5.31 7.13 7.66 6.98 5.93 8.24 6.96 8.22 7.27 7.34 5.91 5.57 6.5 7.28 6.74 4.92 6.88 5.8 9.15 6.63 6.37 8.66 6.4 ], n_folds=5, shuffle=False, random_state=None) On Fri, Jul 8, 2016 at 10:42 PM, Maciek Wójcikowski <[email protected]> wrote: > Hi Michał, > > What are the class counts in that set? Maybe there is a problem with > generating stratified subsamples (eg some classes get below 1 sample)? > > ---- > Pozdrawiam, | Best regards, > Maciek Wójcikowski > [email protected] > > 2016-07-08 17:22 GMT+02:00 Michał Nowotka <[email protected]>: >> >> Hi, >> >> Sorry for cross posting >> >> (http://stackoverflow.com/questions/38263933/scikit-learn-gridsearchcv-fit-method-valueerror-found-array-with-0-sample) >> but I don't know where is better to get help with my problem. >> I'm working on a VM with Jupyter notebook server installed. >> From time to time I add new notebooks and reevaluate old ones to see >> if they still work. >> >> This notebook stopped working due to some changes in scikit-learn API >> and some parameters become obsolete: >> >> >> https://github.com/chembl/mychembl/blob/master/ipython_notebooks/10_myChEMBL_machine_learning.ipynb >> >> I've created a corrected version of the notebook here: >> >> https://gist.github.com/anonymous/676c55cc501ffa48fecfcc1e1252d433 >> >> But I'm stuck in cell 36 on this code: >> >> from sklearn.cross_validation import KFold >> from sklearn.grid_search import GridSearchCV >> >> X_traina, X_testa, y_traina, y_testa = >> cross_validation.train_test_split(x, y, test_size=0.95, >> random_state=23) >> >> params = {'min_samples_split': [8], 'max_depth': [20], >> 'min_samples_leaf': [1],'n_estimators':[200]} >> cv = KFold(n=len(X_traina),n_folds=10,shuffle=True) >> cv_stratified = StratifiedKFold(y_traina, n_folds=5) >> gs = GridSearchCV(custom_forest, params, >> cv=cv_stratified,verbose=1,refit=True) >> gs.fit(X_traina,y_traina) >> >> This gives me: >> >> ValueError: Found array with 0 sample(s) (shape=(0, 491)) while a >> minimum of 1 is required. >> >> Now I don't understand this because when I print shapes of the samples: >> >> print (X_traina.shape, X_testa.shape, y_traina.shape, y_testa.shape) >> >> I'm getting: >> >> ((78, 491), (1489, 491), (78,), (1489,)) >> >> Interestingly, if I change the test_size parameter to 0.88 (like in >> the example corrected notebook) it works and this is the highest value >> where it works. For this value, the shapes are: >> >> ((188, 491), (1379, 491), (188,), (1379,)) >> >> So the question is - what should I change in my code to make it work >> for test_size set to 0.95 as well? >> >> Kind regards, >> >> Michal Nowotka >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
