Hi, do you mean that you get 100% accuracy on the whole training set but only 
~60% when you are evaluating the model on subsets of the training set (cross 
val) when you said:
> accuracy was 100% when tested on the true y.
> 
>  But for every combination of 16 values I randomly assign to y (equally 
> populated 0 and 1) the accuracy is >60% 

This may indicate strong overfitting in this case then ... what about 
increasing the regularization parameter? It's a really tricky situation with 
your 16 samples and 112 features and there may be a lot of noise. I think 
regularization should help though (if there is useful, discriminative 
information buried in the data), especially L1 regularization. The LinearSVC 
has a penalty parameter where you can toggle between L1 and L2. 

But maybe also try some tree-based method to look for feature importances; I 
would just try some feature selection approaches to see if this could 
additionally help for better generalization. And while you are at it, maybe 
also try dimensionality reduction. 

Best,
Sebastian


> On Apr 24, 2015, at 12:53 PM, Fabrizio Fasano <han...@gmail.com> wrote:
> 
> Dear community,
> 
> I'm performing a binary classification on a very small data set:
> 
> details:
> -binary classification (Y=0,1)
> -small dataset (16 samples)
> -large features set (112 features)
> -balanced labels (y=0 and y=1 occur 8 times each)
> -linear SVM classifier.
> 
> accuracy was 100% when tested on the true y. But for every combination of 16 
> values I randomly assign to y (equally populated 0 and 1) the accuracy is 
> >60% (tested by cross validation 25% test 75% train with many CV, not only 
> the StratifiedShuffleSplit one in the code below).
> 
> I understand that "small sample large features set" is a bad thing, but how 
> can the procedure returns an always good result?
> 
> thanks a lot for your help
> 
> Fabrizio
> 
> 
> CODE:
> 
> print "\nWhen a stratified shuffle split is apllied"
> from sklearn.cross_validation import StratifiedShuffleSplit
> sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
> #len(sss)
> print "shuffled permutations:"
> print(sss)
> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> cv_scores=[]
> 
> for train_index, test_index in sss:
>    print("TRAIN:", train_index, "TEST:", test_index)
>    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
>    y_train, y_test = y[train_index], y[test_index]
>    clf.fit(X_train, y_train)
>    y_pred = clf.predict(X_test)
>    print "true label:", y_test
>    print "predicted label", y_pred
>    cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
> 
> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-", 
> np.ceil(200*np.std(cv_scores))
> 
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to