I don't think they're too fast. I tried with slower models and bigger data sets as well. I get the best results with n_jobs=20, which is the number of cores on a single node. Anything below is considerably slower, anything above is mostly the same, sometimes a little slower.
Is there a way to see what each worker is running? Nothing is reported in the scheduler console window about the workers, just that there is a connection to the scheduler. Should something be reported about the work assigned to workers? If I notice speed benefits going from 1 to 20 n_jobs, surely there should be something noticeable above that as well if the distributed part is running correctly, no? This is a very easily parallelizable task, and my nodes are in a cluster on the same network. I highly doubt it's (just) overhead. Is there anything else that I could look into to try fixing this? Fitting 10 folds for each of 10000 candidates, totalling 100000 fits [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.7s [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 4.8s [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 12.6s [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 23.7s [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 37.9s [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 55.0s *[Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 1.2min* --- Fitting 10 folds for each of 10000 candidates, totalling 100000 fits [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.2s [Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 27.5s [Parallel(n_jobs=4)]: Done 442 tasks | elapsed: 1.0min *[Parallel(n_jobs=4)]: Done 792 tasks | elapsed: 1.7min* --- Fitting 10 folds for each of 10000 candidates, totalling 100000 fits [Parallel(n_jobs=100)]: Done 250 tasks | elapsed: 9.1s [Parallel(n_jobs=100)]: Done 600 tasks | elapsed: 19.3s [Parallel(n_jobs=100)]: Done 1050 tasks | elapsed: 34.0s [Parallel(n_jobs=100)]: Done 1600 tasks | elapsed: 49.8s *[Parallel(n_jobs=100)]: Done 2250 tasks | elapsed: 1.2min* If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally do 5x442 = 2210. So double the workers, half the time seems to hold very well until 20 workers. I have a hard time imagining that it would stop holding at exactly the number of cores per node. On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux <gael.varoqu...@normalesup.org> wrote: > My guess is that your model evaluations are too fast, and that you are > not getting the benefits of distributed computing as the overhead is > hiding them. > > Anyhow, I don't think that this is ready for prime-time usage. It > probably requires tweeking and understanding the tradeoffs. > > G > > On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote: > > I copy pasted the example in the link you gave, only made the search > take a > > longer time. I used dask-ssh to setup worker nodes and a scheduler, then > > connected to the scheduler in my code. > > > Tweaking the n_jobs parameters for the randomized search does not get any > > performance benefits. The connection to the scheduler seems to work, but > > nothing gets assigned to the workers, because the code doesn't scale. > > > I am using scikit-learn 0.18.dev0 > > > Any ideas? > > > Code and results are below. Only the n_jobs value was changed between > > executions. I printed an Executor assigned to my scheduler, and it > reported 240 > > cores. > > > import distributed.joblib > > from joblib import Parallel, parallel_backend > > from sklearn.datasets import load_digits > > from sklearn.grid_search import RandomizedSearchCV > > from sklearn.svm import SVC > > import numpy as np > > > digits = load_digits() > > > param_space = { > > 'C': np.logspace(-6, 6, 100), > > 'gamma': np.logspace(-8, 8, 100), > > 'tol': np.logspace(-4, -1, 100), > > 'class_weight': [None, 'balanced'], > > } > > > model = SVC(kernel='rbf') > > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, > verbose=1, > > n_jobs=200) > > > with parallel_backend('distributed', scheduler_host='my_scheduler:8786'): > > search.fit(digits.data, digits.target) > > > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits > > [Parallel(n_jobs=200)]: Done 4 tasks | elapsed: 0.5s > > [Parallel(n_jobs=200)]: Done 292 tasks | elapsed: 6.9s > > [Parallel(n_jobs=200)]: Done 800 tasks | elapsed: 16.1s > > [Parallel(n_jobs=200)]: Done 1250 tasks | elapsed: 24.8s > > [Parallel(n_jobs=200)]: Done 1800 tasks | elapsed: 36.0s > > [Parallel(n_jobs=200)]: Done 2450 tasks | elapsed: 49.0s > > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed: 1.0min finished > > > ------------------------------------- > > > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits > > [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 0.5s > > [Parallel(n_jobs=20)]: Done 160 tasks | elapsed: 3.7s > > [Parallel(n_jobs=20)]: Done 410 tasks | elapsed: 8.6s > > [Parallel(n_jobs=20)]: Done 760 tasks | elapsed: 16.2s > > [Parallel(n_jobs=20)]: Done 1210 tasks | elapsed: 25.0s > > [Parallel(n_jobs=20)]: Done 1760 tasks | elapsed: 36.2s > > [Parallel(n_jobs=20)]: Done 2410 tasks | elapsed: 48.8s > > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed: 1.0min finished > > > > > > > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux < > gael.varoqu...@normalesup.org> > > wrote: > > > Parallel computing in scikit-learn is built upon on joblib. In the > > development version of scikit-learn, the included joblib can be > extended > > with a distributed backend: > > http://distributed.readthedocs.io/en/latest/joblib.html > > that can distribute code on a cluster. > > > This is still bleeding edge, but this is probably a direction that > will > > see more development. > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn