Dear John,

Hello,

I am having difficulty with a cross validation problem, and any help would be much appreciated.

I have a large number of research subjects from 15 different data collection sites. I want to assess whether "site" has any influence on the data.

The simplest way to do this is to start with classical statistics: why not simply testing the impact of the site on the data using classical analysis of variance ?

It occurred to me that one way to do this would be to perform a cross-validation, via stratified k folds (stratified, because some sites have a larger number of subjects than others). Unless I am mistaken, the results of this analysis should reveal whether "site" has an influence on the data. However, I am running into a problem because my training set is a different shape than the test data, which causes the analysis to fail.

My data structure is pretty simple.

X is a 3 by 1000 matrix of datapoints (that is, 3 datapoints per subject)
y is a 1 by 1000 matrix indicating the site (expressed as an integer ranging between 1 and 15).


Here is the code that I use, and below it is the error that is produced.


from sklearn import cross_validation
skf = cross_validation.StratifiedKFold(y, 15)

for train_index, test_index in skf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = svm.SVC(kernel='rbf', C=1.0)
clf.fit(X_train, X_test)

clf.fit should take (X_train, y_train) which means that you are learning a model to predict y form X. is this really what you want ? then clf.score(X_test, y_test) would quantify the performance of the learned model on test data.
HTH,

Bertrand




Traceback (most recent call last):
  File "cross_val.py", line 132, in <module>
clf.fit(X_train, X_test)
File "/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/scikit_learn-0.13.1-py2.7-macosx-10.5-x86_64.egg/sklearn/svm/base.py", line 166, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 966 samples, but y has 210.


Thanks for any help you can offer!




------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to