Dear Riccardo, Many thanks for your extensive answer! Regarding your suggestions:
#1. This is indeed a quick and interesting solution. In fact, given my little knowledge of parallel computing, I was not aware that I could achieve the same parallelization with the two setups. Just to make sure I understand, let's suppose I would like to run 20 R sessions in parallel, where each session uses 4 CPUs. Is it possible to do so even if I have a cluster made of 4 nodes with 20 cores each? #2--4. Thanks for these tips, I will try them out and I will let you know how they work :-) Thanks again for all your help! Nicola On Mon, Oct 5, 2020 at 3:16 PM Riccardo Murri <[email protected]> wrote: > Hello Nicola, > > thanks for your question! I realize this is an important issue for > ElastiCluster (perhaps *the* single most important issue [1]), but I > have done nothing to document workarounds properly. I welcome > suggestions on where/how to mention this in the publicly available > documentation! > > First of all: ElastiCluster depends critically on Ansible, and Ansible > is slow. So, there is not much that can be done to *radically* speed > configuring clusters up. It's a design limitation of ElastiCluster. > > That said, there are a few mitigations that can be applied: > > #1. Use larger nodes > > Given the configuration you posted, I presume you're following > Google's "Running R at scale" guide; that guide sets up a cluster for > spreading single-threaded R functions across a set of compute cores. > In this case, you are interested in the total number of *cores* that > the cluster provides, not so much in their distribution across nodes > (as would be the case, e.g., if you were running a hybrid MPI/OpenMP > application). > > So here's the trick: use fewer larger nodes! 4 nodes with 20 cores > each will be configured ~5x faster than 20 nodes with 4 cores each. > > > #2. Use snapshots > > This will help if you plan on deleting and re-creating clusters with > the same set of installed software over time; for instance, if you are > running need to spin up R clusters 4+ times over the course of a few > months, or if you are going to install a large cluster (say, >50 > nodes). It will *not* help with one-off small cluster setups. > > What takes a lot of time is the initial installation and configuration > of software, which has to be repeated for each node. The idea here is > to do this once, snapshot the running cluster, and use it as a base > for building other clusters. > > The procedure requires a bit of manual intervention: > > - Start a cluster with the exact configuration you want, but only 1 > frontend node and 1 compute node. > - Power off both nodes and create disk images from them (instructions > for Google Cloud at [3], but all cloud providers have a similar > functionality) > - Change your config file to use the snapshotted disks as `image_id` > instead of the pristine OS; note you will need different snapshots / > disk images for frontend and compute nodes. So your config file will > be something like: > > # ... rest of config file as-is > > [cluster/myslurmcluster/frontend] > image_id=my-frontend-disk-image > # ... rest of config file as-is > > [cluster/myslurmcluster/compute] > image_id=my-compute-disk-image > # ... rest of config file as-is > > - Change the number of nodes to match the intended usage and start the > real cluster. > > > #3. Use `mitogen` > > Install the `mitogen` Python package following instructions at > https://mitogen.networkgenomics.com/ansible_detailed.html > > (Note: this won't work if you are using the Dockerized version of > ElastiCluster aka `elasticluster.sh`) > > If `mitogen` is present, ElastiCluster will use it automatically to > speed up Ansible's SSH connections; benefits are evident especially in > conjuction with #2. > > > #4. Adjust ANSIBLE_FORKS > > Contrary to what Google's online article states, picking a fixed > `ansible_forks=` value isn't the best option. The optimal number > depends on the number of CPU cores and the network bandwidth *of the > control machine* (i.e., the one where you're running the > `elasticluster` command). It does *not* depend on the size of the > cluster being built. > > It takes a bit of experimentation to find the optimal number; I > normally start at 4x the number of local CPU cores, keep an eye on the > CPU and network utilization, and then adjust (down if you see CPU or > network being saturated, up if they're not). > > > Please let me know if this is clear enough, and, above all, if it helps :-) > > Ciao, > R > > > [1]: https://github.com/elasticluster/elasticluster/issues/365 > [2]: > https://github.com/elasticluster/elasticluster/blob/master/docs/presentations/hepix2017/slides.pdf > [3]: > https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#create_image > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticluster/CACzCOqGBUUUhxC_xRf4h4dNokEs0mmm%2BSHbcsDyqA3fSmVEMqg%40mail.gmail.com.
