Hi Alex, I’ll open an issue on github in a bit. I got what seemed to be very reasonable prediction results on a training set with fewer sparse features, so it seems like there has to be some issue with using the sparse vectors.
Brett On 6/10/14, 10:55 AM, "Alexandre Gramfort" <alexandre.gramf...@telecom-paristech.fr> wrote: >hi Brett, > >we should move this conversation to github > >please open an issue. > >In the mean time, could it be an overflow? Do you have weird results >with a smaller number of samples/features? > >Alex > > >On Tue, Jun 10, 2014 at 4:00 PM, Brett Meyer ><brett.me...@crowdstrike.com> wrote: >> I have many sparse features, so I¹m hashing those into index ranges for >> different types of feature subsets, so one feature subset will be in the >> index range 1 million to 2 million, the next will be in the range 2 >> million to 3 million, etc. Since there are thousands of features, using >> X.toarray() has run me into problems due to memory issues, which is not >> surprising given the number of features. >> >> There are 20k instances total, 10k positive and 10k negative, and I'm >> using 5-fold cross-validation. In the cross-validation results, there >>are >> several prediction values for which there are 1k-2k samples that all >>have >> the same prediction value, and there are only 3600 distinct prediction >> values over all of the folds for cross-validation. The resulting ROC >> looks like five big stair steps, with some little bits of fuzziness >>around >> the inner corners. >> >> >> >> On 6/10/14, 3:19 AM, "Alexandre Gramfort" >> <alexandre.gramf...@telecom-paristech.fr> wrote: >> >>>hi Brett, >>> >>>your code looks good. >>> >>>can you share a full gist showing clearly the problem? >>> >>>my first check would be to compare using X.toarray() >>>to make sure the dense / sparse code agree. >>> >>>Alex >>> >>> >>> >>>On Mon, Jun 9, 2014 at 9:36 PM, Brett Meyer >>><brett.me...@crowdstrike.com> >>>wrote: >>>> I¹m having an issue using the prediction probabilities for sparse SVM, >>>>where >>>> many of the predictions come out the same for my test instances. >>>>These >>>> probabilities are produced during cross validation, and when I plot an >>>>ROC >>>> curve for the folds, the results look very strange, as there are a >>>>handful >>>> of clustered points on the graph. Here is my cross validation code, I >>>>based >>>> it off of the samples on the scikit website: >>>> >>>> skf = StratifiedKFold(y, n_folds=numfolds) >>>> >>>> for train_index, test_index in skf: >>>> #split the training and testing sets >>>> X_train, X_test = X_scaled[train_index], >>>>X_scaled[test_index] >>>> y_train, y_test = y[train_index], y[test_index] >>>> >>>> #train on the subset for this fold >>>> print 'Training on fold ' + str(fold) >>>> classifier = svm.SVC(C=C_val, kernel='rbf', >>>>gamma=gamma_val, >>>> probability=True) >>>> probas_ = classifier.fit(X_train, >>>>y_train).predict_proba(X_test) >>>> >>>> #Compute ROC curve and area the curve >>>> fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1]) >>>> mean_tpr += interp(mean_fpr, fpr, tpr) >>>> mean_tpr[0] = 0.0 >>>> roc_auc = auc(fpr, tpr) >>>> >>>> I¹m just trying to figure out if there¹s something I¹m obviously >>>>missing >>>> here, since I used this same training set and SVM parameters with >>>>libsvm and >>>> got much better results. When I used libsvm and printed out the >>>>distances >>>> from the hyperplane for the CV test instances and then plotted the >>>>ROC, >>>>it >>>> came out much more like I expected, and a much better AUC. Any >>>>pointers >>>> would be greatly appreciated! >>>> >>>> Brett Meyer >>>> >>>> >>>> >>>>----------------------------------------------------------------------- >>>>-- >>>>----- >>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>>Solutions >>>> Find What Matters Most in Your Big Data with HPCC Systems >>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>> http://p.sf.net/sfu/hpccsystems >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> Scikit-learn-general@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>> >>>------------------------------------------------------------------------ >>>-- >>>---- >>>HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>Solutions >>>Find What Matters Most in Your Big Data with HPCC Systems >>>Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>http://p.sf.net/sfu/hpccsystems >>>_______________________________________________ >>>Scikit-learn-general mailing list >>>Scikit-learn-general@lists.sourceforge.net >>>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >>------------------------------------------------------------------------- >>----- >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> http://p.sf.net/sfu/hpccsystems >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >-------------------------------------------------------------------------- >---- >HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >Find What Matters Most in Your Big Data with HPCC Systems >Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >Leverages Graph Analysis for Fast Processing & Easy Data Exploration >http://p.sf.net/sfu/hpccsystems >_______________________________________________ >Scikit-learn-general mailing list >Scikit-learn-general@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
smime.p7s
Description: S/MIME cryptographic signature
------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general