hi Brett,

we should move this conversation to github

please open an issue.

In the mean time, could it be an overflow? Do you have weird results
with a smaller number of samples/features?

Alex


On Tue, Jun 10, 2014 at 4:00 PM, Brett Meyer
<brett.me...@crowdstrike.com> wrote:
> I have many sparse features, so I¹m hashing those into index ranges for
> different types of feature subsets, so one feature subset will be in the
> index range 1 million to 2 million, the next will be in the range 2
> million to 3 million, etc.  Since there are thousands of features, using
> X.toarray() has run me into problems due to memory issues, which is not
> surprising given the number of features.
>
> There are 20k instances total, 10k positive and 10k negative, and I'm
> using 5-fold cross-validation.  In the cross-validation results, there are
> several prediction values for which there are 1k-2k samples that all have
> the same prediction value, and there are only 3600 distinct prediction
> values over all of the folds for cross-validation.  The resulting ROC
> looks like five big stair steps, with some little bits of fuzziness around
> the inner corners.
>
>
>
> On 6/10/14, 3:19 AM, "Alexandre Gramfort"
> <alexandre.gramf...@telecom-paristech.fr> wrote:
>
>>hi Brett,
>>
>>your code looks good.
>>
>>can you share a full gist showing clearly the problem?
>>
>>my first check would be to compare using X.toarray()
>>to make sure the dense / sparse code agree.
>>
>>Alex
>>
>>
>>
>>On Mon, Jun 9, 2014 at 9:36 PM, Brett Meyer <brett.me...@crowdstrike.com>
>>wrote:
>>> I¹m having an issue using the prediction probabilities for sparse SVM,
>>>where
>>> many of the predictions come out the same for my test instances.  These
>>> probabilities are produced during cross validation, and when I plot an
>>>ROC
>>> curve for the folds, the results look very strange, as there are a
>>>handful
>>> of clustered points on the graph.  Here is my cross validation code, I
>>>based
>>> it off of the samples on the scikit website:
>>>
>>> skf = StratifiedKFold(y, n_folds=numfolds)
>>>
>>> for train_index, test_index in skf:
>>>             #split the training and testing sets
>>>             X_train, X_test = X_scaled[train_index],
>>>X_scaled[test_index]
>>>             y_train, y_test = y[train_index], y[test_index]
>>>
>>>             #train on the subset for this fold
>>>             print 'Training on fold ' + str(fold)
>>>             classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val,
>>> probability=True)
>>>             probas_ = classifier.fit(X_train,
>>>y_train).predict_proba(X_test)
>>>
>>>             #Compute ROC curve and area the curve
>>>             fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
>>>             mean_tpr += interp(mean_fpr, fpr, tpr)
>>>             mean_tpr[0] = 0.0
>>>             roc_auc = auc(fpr, tpr)
>>>
>>> I¹m just trying to figure out if there¹s something I¹m obviously missing
>>> here, since I used this same training set and SVM parameters with
>>>libsvm and
>>> got much better results.  When I used libsvm and printed out the
>>>distances
>>> from the hyperplane for the CV test instances and then plotted the ROC,
>>>it
>>> came out much more like I expected, and a much better AUC.  Any pointers
>>> would be greatly appreciated!
>>>
>>> Brett Meyer
>>>
>>>
>>>
>>>-------------------------------------------------------------------------
>>>-----
>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>Solutions
>>> Find What Matters Most in Your Big Data with HPCC Systems
>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>> http://p.sf.net/sfu/hpccsystems
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>--------------------------------------------------------------------------
>>----
>>HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>Find What Matters Most in Your Big Data with HPCC Systems
>>Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>http://p.sf.net/sfu/hpccsystems
>>_______________________________________________
>>Scikit-learn-general mailing list
>>Scikit-learn-general@lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to