2012/10/8 Michael Becker <[email protected]>:
> Any other recommendations would be appreciated. I can see I'm not the only
> one to experience these kinds of issues with GridSearchCV. Ideally I would
> like to be able to specify many more parameters to test without experiencing
> a MemoryError or excessive swap utilization.

Using TfidfVectorizer instead of separate CountVectorizer and
TfidfTransformer should prevent data being copied. That might help.

Then, using SGDClassifier instead of LinearSVC would probably help a
lot; by default, it fits a linear SVM as well, though by a different
algorithm and it doesn't copy it's entire X matrix. You'll have to
reformulate the C parameter in terms of alpha, which I believe is just
1/C (so search over alpha in [1, .1, .01, .001]).

Also, there's a proposed fix to CountVectorizer at
https://github.com/scikit-learn/scikit-learn/pull/1135 which promises
a six-fold reduction in memory usage, but unfortunately we haven't got
round to merging it yet.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to