2012/10/8 Michael Becker <[email protected]>: > Any other recommendations would be appreciated. I can see I'm not the only > one to experience these kinds of issues with GridSearchCV. Ideally I would > like to be able to specify many more parameters to test without experiencing > a MemoryError or excessive swap utilization.
Using TfidfVectorizer instead of separate CountVectorizer and TfidfTransformer should prevent data being copied. That might help. Then, using SGDClassifier instead of LinearSVC would probably help a lot; by default, it fits a linear SVM as well, though by a different algorithm and it doesn't copy it's entire X matrix. You'll have to reformulate the C parameter in terms of alpha, which I believe is just 1/C (so search over alpha in [1, .1, .01, .001]). Also, there's a proposed fix to CountVectorizer at https://github.com/scikit-learn/scikit-learn/pull/1135 which promises a six-fold reduction in memory usage, but unfortunately we haven't got round to merging it yet. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
