Hi Jian,

1. your pipeline probably has other sources of non-determinism. SVC also
has a random_state parameter, for example. You should define all
random_state parameters in your pipeline.

2. Yes and yes. Your best shot is to split them randomly, AFAIK.

Best regards,

José Ricardo


On Thu, Dec 12, 2013 at 6:39 PM, Su, Jian, Ph.D. <su.j...@mayo.edu> wrote:

>
>  Hello,
>
>  I am using pipeline and grid to find the best hyperparameters, as the
> code in the end of the post.
>
>  Here are two questions:
> 1. Even I set random_state=0, the results are not the same every time. How
> can I find the "truth"?
> 0.867933723197 {'clf__bootstrap': False, 'clf__max_depth': 10,
> 'features__univ_select__k': 6, 'clf__n_estimators': 14,
> 'features__pca__n_components': 3}
>  0.888569974774 {'clf__bootstrap': True, 'clf__max_depth': 9,
> 'features__univ_select__k': 5, 'clf__n_estimators': 13,
> 'features__pca__n_components': 3}
> 0.885452499713 {'clf__bootstrap': True, 'clf__max_depth': 7,
> 'features__univ_select__k': 6, 'clf__n_estimators': 13,
> 'features__pca__n_components': 3}
>
>  2. To evaluate the classifier, should I use a separate dataset other
> than X, right?
> grid_search.predict(X_extra, y_extra)
> If it's the way, since (X+X_extra) are actually all I have, the separation
> of X and X_extra will affect the evaluation, right?
>
>  Thank you,
> Jian
>
>
>  >>>>>>>>>>>>>>>>>>>>
>  X = preprocessing.scale(X)
> n_samples, n_features = np.shape(X)
>  pca = PCA(n_components=2)
> selection = SelectKBest(k=3)
> combined_features = FeatureUnion([("pca", pca), ("univ_select",
> selection)])
>  svm = SVC(C=1)
> pipeline = Pipeline([("features", combined_features), ("svm", svm)])
> param_grid = dict(features__pca__n_components=[1, 2, 3, 4, 5],
>                   features__univ_select__k=[1, 2, 3, 4, 5, 6],
>                   svm__C=[0.1, 0.3, 1, 3, 10, 30],
>                   svm__gamma=[0.01, 0.03, 0.1, 0.3, 1],
>                   svm__kernel=['rbf','linear']
>                   )
> cv =  cross_validation.ShuffleSplit(n_samples, n_iter=3,test_size=0.3,
> random_state=0)
> grid_search = GridSearchCV(pipeline, param_grid=param_grid,
> scoring='roc_auc', cv=cv, refit=True, n_jobs=-1)
> grid_search.fit(X, y)
> print grid_search.best_score_, grid_search.best_params_
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to