Re: [Scikit-learn-general] GridSearchCV using too many cores?

Sebastian Raschka Thu, 24 Sep 2015 08:43:41 -0700

> Memory usage was a huge problem, 
>


I had similar issues when I was running code on our clusters. I think that 
software & hardware architecture may also be a factor to be considered. What I 
observed was that the memory usage was growing over time. I added a few lines 
to manually clear the garbage collection, which solved the issue.

> "PBS: job killed: ncpus 19.73 exceeded limit 8 (sum)"
> 

Hm, I am not sure where this is coming from: 1 for the main process, 6 for the 
parallel jobs, and 12 pre-dispatched jobs? I guess the pre-dispatching may be 
the problem here? 

> I've tried playing around the pre_dispatch but it makes difference. 
> 
> 

Have you tried to run it without it? I am wondering if you intended to write 
“no” difference?

Best,
Sebastian


> On Sep 24, 2015, at 9:41 AM, Joel Nothman <joel.noth...@gmail.com> wrote:
> 
> In terms of memory: I gather joblib.parallel is meant to automatically memmap 
> large arrays (>100MB). However, then each subprocess will extract a 
> non-contiguous set of samples from the data for training under a 
> cross-validation regime. Would I be right in thinking that's where the memory 
> blowout comes from? When there's risk of such an expensive indexing, should 
> we be using sample_weight (where the base estimator supports it) to select 
> portions of the training data without copy?
> 
> On 24 September 2015 at 23:21, Dale Smith <dsm...@nexidia.com 
> <mailto:dsm...@nexidia.com>> wrote:
> My experiences with parallel GridSearchCV and RFECV have not been pleasant. 
> Memory usage was a huge problem, as apparently each job got a copy of the 
> data with an out-of-the box scikit-learn installation using Anaconda 3. No 
> matter how I set pre_dispatch, I could not get n_jobs = 2 to work, even with 
> no one else using a 100 gb 24 core Windows box.
> 
>  
> 
> I can create some reproducible code if anyone has time to work on it.
> 
>  
> 
> 
> Dale Smith, Ph.D.
> Data Scientist
> 
> <image001.png> <http://nexidia.com/>
> 
> d. 404.495.7220 x 4008 <tel:404.495.7220%20x%204008>   f. 404.795.7221 
> <tel:404.795.7221>
> Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 
> 30305
> 
> <image002.jpg> <http://blog.nexidia.com/> <image003.jpg> 
> <https://www.linkedin.com/company/nexidia> <image004.jpg> 
> <https://plus.google.com/u/0/107921893643164441840/posts> <image005.jpg> 
> <https://twitter.com/Nexidia> <image006.jpg> 
> <https://www.youtube.com/user/NexidiaTV>
>  
> 
> From: Clyde Fare [mailto:clyde.f...@gmail.com <mailto:clyde.f...@gmail.com>] 
> Sent: Thursday, September 24, 2015 8:38 AM
> To: scikit-learn-general@lists.sourceforge.net 
> <mailto:scikit-learn-general@lists.sourceforge.net>
> Subject: [Scikit-learn-general] GridSearchCV using too many cores?
> 
>  
> 
> Hi,
> 
>  
> 
> I'm trying to run GridSearchCV on a computational cluster but my jobs keep 
> failing with an error from the queuing system claiming I'm using too many 
> cores.
> 
>  
> 
> If I set n_jobs equal 1, then the job doesn't fail but if it's more than one, 
> no matter what number it is the job fails.
> 
>  
> 
> In the example below I've set n_jobs to 6 and pre_dispatch to 12, and asked 
> for 8 processors from the queue. I got the following error after ~10 minutes: 
> "PBS: job killed: ncpus 19.73 exceeded limit 8 (sum)"
> 
>  
> 
> I've tried playing around the pre_dispatch but it makes difference. There 
> will be other people running calculations on these nodes, so might there be 
> some kind of intereference between GridSearchCV and the other jobs?
> 
>  
> 
> Anyone come across anything like this before?
> 
>  
> 
> Cheers
> 
>  
> 
> Clyde
> 
>  
> 
>  
> 
> import dill
> 
> import numpy as np
> 
>  
> 
> from sklearn.kernel_ridge import KernelRidge
> 
> from sklearn.grid_search import GridSearchCV
> 
>  
> 
> label='test_grdsrch3'
> 
> X_train = np.random.rand(971,276)
> 
> y_train = np.random.rand(971)
> 
>  
> 
> kr = GridSearchCV(KernelRidge(), cv=10,
> 
>                   param_grid={"kernel": ['rbf', 'laplacian'],
> 
>                               "alpha": [2**i for i in np.arange(-40,-5,0.5)], 
>                 #alpha=lambda
> 
>                               "gamma": [1/(2.**(2*i)) for i in 
> np.arange(5,18,0.5)]},   #gamma = 1/sigma^2
> 
>                   pre_dispatch=12,
> 
>                   n_jobs=6)
> 
>  
> 
> kr.fit(X_train, y_train)
> 
>  
> 
> with open(label+'.pkl','w') as data_f:
> 
>         dill.dump(kr, data_f)
> 
>  
> 
> 
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 
> <http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net 
> <mailto:Scikit-learn-general@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> 
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GridSearchCV using too many cores?

Reply via email to