> Memory usage was a huge problem,
>
I had similar issues when I was running code on our clusters. I think that
software & hardware architecture may also be a factor to be considered. What I
observed was that the memory usage was growing over time. I added a few lines
to manually clear the garbage collection, which solved the issue.
> "PBS: job killed: ncpus 19.73 exceeded limit 8 (sum)"
>
Hm, I am not sure where this is coming from: 1 for the main process, 6 for the
parallel jobs, and 12 pre-dispatched jobs? I guess the pre-dispatching may be
the problem here?
> I've tried playing around the pre_dispatch but it makes difference.
>
>
Have you tried to run it without it? I am wondering if you intended to write
“no” difference?
Best,
Sebastian
> On Sep 24, 2015, at 9:41 AM, Joel Nothman <joel.noth...@gmail.com> wrote:
>
> In terms of memory: I gather joblib.parallel is meant to automatically memmap
> large arrays (>100MB). However, then each subprocess will extract a
> non-contiguous set of samples from the data for training under a
> cross-validation regime. Would I be right in thinking that's where the memory
> blowout comes from? When there's risk of such an expensive indexing, should
> we be using sample_weight (where the base estimator supports it) to select
> portions of the training data without copy?
>
> On 24 September 2015 at 23:21, Dale Smith <dsm...@nexidia.com
> <mailto:dsm...@nexidia.com>> wrote:
> My experiences with parallel GridSearchCV and RFECV have not been pleasant.
> Memory usage was a huge problem, as apparently each job got a copy of the
> data with an out-of-the box scikit-learn installation using Anaconda 3. No
> matter how I set pre_dispatch, I could not get n_jobs = 2 to work, even with
> no one else using a 100 gb 24 core Windows box.
>
>
>
> I can create some reproducible code if anyone has time to work on it.
>
>
>
>
> Dale Smith, Ph.D.
> Data Scientist
>
> <image001.png> <http://nexidia.com/>
>
> d. 404.495.7220 x 4008 <tel:404.495.7220%20x%204008> f. 404.795.7221
> <tel:404.795.7221>
> Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA
> 30305
>
> <image002.jpg> <http://blog.nexidia.com/> <image003.jpg>
> <https://www.linkedin.com/company/nexidia> <image004.jpg>
> <https://plus.google.com/u/0/107921893643164441840/posts> <image005.jpg>
> <https://twitter.com/Nexidia> <image006.jpg>
> <https://www.youtube.com/user/NexidiaTV>
>
>
> From: Clyde Fare [mailto:clyde.f...@gmail.com <mailto:clyde.f...@gmail.com>]
> Sent: Thursday, September 24, 2015 8:38 AM
> To: scikit-learn-general@lists.sourceforge.net
> <mailto:scikit-learn-general@lists.sourceforge.net>
> Subject: [Scikit-learn-general] GridSearchCV using too many cores?
>
>
>
> Hi,
>
>
>
> I'm trying to run GridSearchCV on a computational cluster but my jobs keep
> failing with an error from the queuing system claiming I'm using too many
> cores.
>
>
>
> If I set n_jobs equal 1, then the job doesn't fail but if it's more than one,
> no matter what number it is the job fails.
>
>
>
> In the example below I've set n_jobs to 6 and pre_dispatch to 12, and asked
> for 8 processors from the queue. I got the following error after ~10 minutes:
> "PBS: job killed: ncpus 19.73 exceeded limit 8 (sum)"
>
>
>
> I've tried playing around the pre_dispatch but it makes difference. There
> will be other people running calculations on these nodes, so might there be
> some kind of intereference between GridSearchCV and the other jobs?
>
>
>
> Anyone come across anything like this before?
>
>
>
> Cheers
>
>
>
> Clyde
>
>
>
>
>
> import dill
>
> import numpy as np
>
>
>
> from sklearn.kernel_ridge import KernelRidge
>
> from sklearn.grid_search import GridSearchCV
>
>
>
> label='test_grdsrch3'
>
> X_train = np.random.rand(971,276)
>
> y_train = np.random.rand(971)
>
>
>
> kr = GridSearchCV(KernelRidge(), cv=10,
>
> param_grid={"kernel": ['rbf', 'laplacian'],
>
> "alpha": [2**i for i in np.arange(-40,-5,0.5)],
> #alpha=lambda
>
> "gamma": [1/(2.**(2*i)) for i in
> np.arange(5,18,0.5)]}, #gamma = 1/sigma^2
>
> pre_dispatch=12,
>
> n_jobs=6)
>
>
>
> kr.fit(X_train, y_train)
>
>
>
> with open(label+'.pkl','w') as data_f:
>
> dill.dump(kr, data_f)
>
>
>
>
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
> <http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> <mailto:Scikit-learn-general@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>
>
> ------------------------------------------------------------------------------
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140_______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general