2012/1/23 Dimitrios Pritsos <[email protected]>:
>
> However, when I do the same test using partial_fit() for the same
> sub-set of my Data Set (see above) I am getting ~20%.
>
> Any Suggestions?

Do a grid search to find the best alpha on SGDClassifier (and on C for
the LinearSVC classifier). For instance:

>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> twenty = fetch_20newsgroups_vectorized()

>>> param_grid = {'alpha': [1e-3, 1e-4, 1e-5]}
>>> gs = GridSearchCV(SGDClassifier(), param_grid).fit(twenty.data, 
>>> twenty.target)

>>> gs.best_estimator_
SGDClassifier(alpha=0.0001, class_weight=None, eta0=0.0, fit_intercept=True,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, rho=0.85, seed=0, shuffle=False,
       verbose=0, warm_start=False)
>>> gs.best_score_
0.8575220898001239

You can also include 'n_iter': [5, 10, 50] and 'class_weight':
['auto', None] in the param_grid but beware of the combinatorial
explosion in computation time.

Don't worry about partial_fit as your data will fit in memory with the
CSR format.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to