Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

Pagliari, Roberto Fri, 12 Sep 2014 09:32:01 -0700

Hi Andy,
I don't think the accuracy is an issue. I explicitly provided a score function 
and the problem persists.
With my own gridsearch I don't use pipeline, just stratifiedKFold and average 
for every combination of the parameters.


This is an example with scaling+svm using sklearn pipeline:

    estimators = [('scaler', StandardScaler()),
                     ('linear_svm', svm.LinearSVC(class_weight='auto',))]

    clf_pipeline = Pipeline(estimators)
    params = dict(linear_svm__C=<some array of values>)
    clf = grid_search.GridSearchCV(clf_pipeline, param_grid=params)
    clf.fit(X_train, y_train) # here I'm not scaling since I assume gridsearch 
will do while searching

After this I make the predictions
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    y_predictions = clf.predict(X_test)

with binning, I would just add the Binarizer to the pipeline, and right before 
computing y_predictions.

Is there anything wrong with what I'm doing?

Thank you


From: Andy [mailto:t3k...@gmail.com]
Sent: Friday, September 12, 2014 12:12 PM
To: scikit-learn-general@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] getting different results with sklearn 
gridsearchCV

Hi Roberto.
GridSearchCV uses accuracy for selection if not other method is specified, so 
there should be no difference.

Could you provide code?
Do you also create a pipeline when using your own grid search? I would imagine 
there is some difference in how you do the fitting in the pipeline.

Cheers,
Andy


On 09/12/2014 05:09 PM, Pagliari, Roberto wrote:
Regarding my previous question, I suspect the difference lies in the scoring 
function.

What is the default scoring function used by gridsearch?

In my own implementation  I am using
number of correctly classified samples (no weighting) / total number of samples

sklearn gridsearch function must be using something else, or maybe the same, 
but with weighting?

Thanks,


From: Pagliari, Roberto
Sent: Friday, September 12, 2014 10:21 AM
To: 
'scikit-learn-general@lists.sourceforge.net<mailto:scikit-learn-general@lists.sourceforge.net>'
Subject: getting different results with sklearn gridsearchCV

I am comparing the results of sklearn cross-validation and my own cross 
validation.

I tested linearSVC under the following conditions:

-          Data scaling per grid search

-          Data scaling + 2-level quantization, per grid search

Specifically, I have done the following:
Sklearn gridSearchCV

-          Create a pipeline with [StandardScaler, LinearSVC] if no binning is 
used,  or [StandardScaler, Binarizer, LinearSVC], if binning is used

-          Invoke sklearn gridsearch (only C is provided as a parameter to 
optimize over)

-          When done with gridsearch,

o   Scale entire training set

o   Scale test set (with mean/std found on training set)

o   Quantize, if quantization is used

o    run LinearSVC, with best C value found

My own grid search

-          Search over all possible values of C (same range as above)

-          For each value of C, use stratifiedKFold with random_seed set to a 
random number

o   Scale train cross-validation datased, and test cross validation dataset 
with train cv mean and std

o   If binning is used, apply binary binning (my own function), on top of 
StandardScaler

o   For each value of C compute average score over all partition, where the 
score is defined as number of correctly classified samples / total number of 
samples

-          When done with gridsearch,

o   Scale entire training set

o   Scale test set (with mean/std found on training set)

o   Quantize, if quantization is used

o    run LinearSVC, with best C value found

For some reason, I'm getting different results. In particular, sklearn 
gridsearch performs better than my own gridsearch when not using quantization, 
and it gets worse with quantization. With my own gridsearch I'm getting the 
opposite trend.

Is my understanding of sklearn gridsearch wrong, or are there any issues with 
it?

Thank you,





------------------------------------------------------------------------------

Want excitement?

Manually upgrade your production database.

When you want reliability, choose Perforce

Perforce version control. Predictably reliable.

http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk




_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] getting different results with sklearn gridsearchCV

Reply via email to