2012/1/23 Dimitrios Pritsos <[email protected]>:
>
> However, when I do the same test using partial_fit() for the same
> sub-set of my Data Set (see above) I am getting ~20%.
>
> Any Suggestions?
Do a grid search to find the best alpha on SGDClassifier (and on C for
the LinearSVC classifier). For instance:
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> twenty = fetch_20newsgroups_vectorized()
>>> param_grid = {'alpha': [1e-3, 1e-4, 1e-5]}
>>> gs = GridSearchCV(SGDClassifier(), param_grid).fit(twenty.data,
>>> twenty.target)
>>> gs.best_estimator_
SGDClassifier(alpha=0.0001, class_weight=None, eta0=0.0, fit_intercept=True,
learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
penalty='l2', power_t=0.5, rho=0.85, seed=0, shuffle=False,
verbose=0, warm_start=False)
>>> gs.best_score_
0.8575220898001239
You can also include 'n_iter': [5, 10, 50] and 'class_weight':
['auto', None] in the param_grid but beware of the combinatorial
explosion in computation time.
Don't worry about partial_fit as your data will fit in memory with the
CSR format.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general