Re: [Scikit-learn-general] data preprocessing

Andreas Mueller Thu, 01 Nov 2012 10:11:09 -0700

On 11/01/2012 03:43 PM, paul.czodrow...@merckgroup.com wrote:

Dear RDKitters,



> > However, I found it strange that "X_train.shape" gives (373, 177) -
> > shouldn't be the second bit be the number of classes, i.e. 2?
>
> [snip]
>
> > 177 corresponds, BTW, to the number of features..
>
> And that's exactly what this is supposed to represent. The number of
> classes is len(np.unique(y)).
>

Got it, thanks!

X_train looks like
"[[ 313.371  294.219    0.    ...,    0. 0.       0.   ]
[ 234.343  212.167    0.    ...,    0.       0.       0.   ] ...
"

y_train
"
[0 0 ..
"

X_test & y_test look similar and therefore OK in my naive eyes.However, SVC gives horrible results:

"
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train,y_train)
y_predict = svm.predict(X_test)
print metrics.confusion_matrix(y_test,y_predict)
"
=>
[[182   0]
[ 67   0]]

1) First, start with a linear classifier.
2) For SVC you have to cross-validate gamma and C.
3) Your classes are unbalanced. Consider reweighting or subsampling

kNN stops with an error message:
"
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
y_predict = knn.predict(X_test)
"
=>
"
NeighborsWarning: kneighbors: neighbor k+1 and neighbor k have thesame distance: results will be dependent on data order.
 neigh_dist, neigh_ind = self.kneighbors(X)
"

As you can see from the message, this is a warning, not an error.

It probably means that either your data is discretized or you haveduplicate input points.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] data preprocessing

Reply via email to