I have many sparse features, so I¹m hashing those into index ranges for different types of feature subsets, so one feature subset will be in the index range 1 million to 2 million, the next will be in the range 2 million to 3 million, etc. Since there are thousands of features, using X.toarray() has run me into problems due to memory issues, which is not surprising given the number of features.
There are 20k instances total, 10k positive and 10k negative, and I'm using 5-fold cross-validation. In the cross-validation results, there are several prediction values for which there are 1k-2k samples that all have the same prediction value, and there are only 3600 distinct prediction values over all of the folds for cross-validation. The resulting ROC looks like five big stair steps, with some little bits of fuzziness around the inner corners. On 6/10/14, 3:19 AM, "Alexandre Gramfort" <alexandre.gramf...@telecom-paristech.fr> wrote: >hi Brett, > >your code looks good. > >can you share a full gist showing clearly the problem? > >my first check would be to compare using X.toarray() >to make sure the dense / sparse code agree. > >Alex > > > >On Mon, Jun 9, 2014 at 9:36 PM, Brett Meyer <brett.me...@crowdstrike.com> >wrote: >> I¹m having an issue using the prediction probabilities for sparse SVM, >>where >> many of the predictions come out the same for my test instances. These >> probabilities are produced during cross validation, and when I plot an >>ROC >> curve for the folds, the results look very strange, as there are a >>handful >> of clustered points on the graph. Here is my cross validation code, I >>based >> it off of the samples on the scikit website: >> >> skf = StratifiedKFold(y, n_folds=numfolds) >> >> for train_index, test_index in skf: >> #split the training and testing sets >> X_train, X_test = X_scaled[train_index], >>X_scaled[test_index] >> y_train, y_test = y[train_index], y[test_index] >> >> #train on the subset for this fold >> print 'Training on fold ' + str(fold) >> classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val, >> probability=True) >> probas_ = classifier.fit(X_train, >>y_train).predict_proba(X_test) >> >> #Compute ROC curve and area the curve >> fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1]) >> mean_tpr += interp(mean_fpr, fpr, tpr) >> mean_tpr[0] = 0.0 >> roc_auc = auc(fpr, tpr) >> >> I¹m just trying to figure out if there¹s something I¹m obviously missing >> here, since I used this same training set and SVM parameters with >>libsvm and >> got much better results. When I used libsvm and printed out the >>distances >> from the hyperplane for the CV test instances and then plotted the ROC, >>it >> came out much more like I expected, and a much better AUC. Any pointers >> would be greatly appreciated! >> >> Brett Meyer >> >> >> >>------------------------------------------------------------------------- >>----- >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> http://p.sf.net/sfu/hpccsystems >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >-------------------------------------------------------------------------- >---- >HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >Find What Matters Most in Your Big Data with HPCC Systems >Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >Leverages Graph Analysis for Fast Processing & Easy Data Exploration >http://p.sf.net/sfu/hpccsystems >_______________________________________________ >Scikit-learn-general mailing list >Scikit-learn-general@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
smime.p7s
Description: S/MIME cryptographic signature
------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general