Hi, do you mean that you get 100% accuracy on the whole training set but only
~60% when you are evaluating the model on subsets of the training set (cross
val) when you said:
> accuracy was 100% when tested on the true y.
>
> But for every combination of 16 values I randomly assign to y (equally
> populated 0 and 1) the accuracy is >60%
This may indicate strong overfitting in this case then ... what about
increasing the regularization parameter? It's a really tricky situation with
your 16 samples and 112 features and there may be a lot of noise. I think
regularization should help though (if there is useful, discriminative
information buried in the data), especially L1 regularization. The LinearSVC
has a penalty parameter where you can toggle between L1 and L2.
But maybe also try some tree-based method to look for feature importances; I
would just try some feature selection approaches to see if this could
additionally help for better generalization. And while you are at it, maybe
also try dimensionality reduction.
Best,
Sebastian
> On Apr 24, 2015, at 12:53 PM, Fabrizio Fasano <han...@gmail.com> wrote:
>
> Dear community,
>
> I'm performing a binary classification on a very small data set:
>
> details:
> -binary classification (Y=0,1)
> -small dataset (16 samples)
> -large features set (112 features)
> -balanced labels (y=0 and y=1 occur 8 times each)
> -linear SVM classifier.
>
> accuracy was 100% when tested on the true y. But for every combination of 16
> values I randomly assign to y (equally populated 0 and 1) the accuracy is
> >60% (tested by cross validation 25% test 75% train with many CV, not only
> the StratifiedShuffleSplit one in the code below).
>
> I understand that "small sample large features set" is a bad thing, but how
> can the procedure returns an always good result?
>
> thanks a lot for your help
>
> Fabrizio
>
>
> CODE:
>
> print "\nWhen a stratified shuffle split is apllied"
> from sklearn.cross_validation import StratifiedShuffleSplit
> sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
> #len(sss)
> print "shuffled permutations:"
> print(sss)
> clf = svm.LinearSVC(penalty="l1", dual=False, C=1, random_state=1)
> cv_scores=[]
>
> for train_index, test_index in sss:
> print("TRAIN:", train_index, "TEST:", test_index)
> X_train, X_test = X_scaled[train_index], X_scaled[test_index]
> y_train, y_test = y[train_index], y[test_index]
> clf.fit(X_train, y_train)
> y_pred = clf.predict(X_test)
> print "true label:", y_test
> print "predicted label", y_pred
> cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))
>
> print "Accuracy ", np.ceil(100*np.mean(cv_scores)), "+/-",
> np.ceil(200*np.std(cv_scores))
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general