2012/10/8 Michael Becker <[email protected]>:
> I'm noticing GridSearchCV is a huge memory hog when used in conjunction with
> the following pipeline:
> text_clf = Pipeline([
>     ('vect', CountVectorizer()),
>     ('tfidf', TfidfTransformer()),
>     ('clf', LinearSVC()),
> ])
> I'm using this with a custom dataset of roughly 60k text documents to do
> language classification on a machine with 196GB of memory, 400GB of swap (on
> a SSD), and 24 cores. If I try to use all 24 cores, I always get a
> MemoryError. I also notice that if I increase the number of parameters I
> pass into GridSearchCV, the amount of memory balloons as well. Right now I'm
> running it with the following parameters, and it's using up all the memory
> and 93% of the swap. Obviously all this swapping is impacting performance as
> well which is less than ideal:
> parameters = {
>     'vect__ngram_range': ((2, 3), (3, 3)),
>     'clf__C': (1, 10, 100, 1000),
> }
>
> https://github.com/scikit-learn/scikit-learn/issues/565 claims to fix an
> issue in which all the estimators are kept in memory. I thought the issue
> might be that I don't have these changes but I checked and in the version
> I'm using (0.11) these changes are there. Looking closer at the code, it
> seems like 2 parameters might be useful in lowering the amount of memory
> used. It appears that if I set refit=False, it won't keep the best estimator
> in memory, but I'm doubtful this will take up much memory. It seems much
> more likely that setting pre_dispatch to a lower value (I'm using the
> default of '2*n_jobs') is more likely to help with memory utilization.
>
> I suppose a 3rd option is to use a smaller dataset for the GridSearch,
> however I'm concerned this could present issues with overfitting.
>
> Any other recommendations would be appreciated. I can see I'm not the only
> one to experience these kinds of issues with GridSearchCV. Ideally I would
> like to be able to specify many more parameters to test without experiencing
> a MemoryError or excessive swap utilization.

The problem is well known: joblib forks process and the python GC
breaks the unix copy on write triggering useless memory copies of
readonly data.

A fix is under development on joblib but the initial solution [1] no
longer works with numpy dev (soon to be 1.7). I have found a solution
[2] to address the numpy behavioral change but that requires further
work before being in a mergeable state.

[1] https://github.com/joblib/joblib/pull/44
[2] https://github.com/ogrisel/joblib/tree/pickling-pool-base-collapsing

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to