Hello Nicola, thanks for your question! I realize this is an important issue for ElastiCluster (perhaps *the* single most important issue [1]), but I have done nothing to document workarounds properly. I welcome suggestions on where/how to mention this in the publicly available documentation!
First of all: ElastiCluster depends critically on Ansible, and Ansible is slow. So, there is not much that can be done to *radically* speed configuring clusters up. It's a design limitation of ElastiCluster. That said, there are a few mitigations that can be applied: #1. Use larger nodes Given the configuration you posted, I presume you're following Google's "Running R at scale" guide; that guide sets up a cluster for spreading single-threaded R functions across a set of compute cores. In this case, you are interested in the total number of *cores* that the cluster provides, not so much in their distribution across nodes (as would be the case, e.g., if you were running a hybrid MPI/OpenMP application). So here's the trick: use fewer larger nodes! 4 nodes with 20 cores each will be configured ~5x faster than 20 nodes with 4 cores each. #2. Use snapshots This will help if you plan on deleting and re-creating clusters with the same set of installed software over time; for instance, if you are running need to spin up R clusters 4+ times over the course of a few months, or if you are going to install a large cluster (say, >50 nodes). It will *not* help with one-off small cluster setups. What takes a lot of time is the initial installation and configuration of software, which has to be repeated for each node. The idea here is to do this once, snapshot the running cluster, and use it as a base for building other clusters. The procedure requires a bit of manual intervention: - Start a cluster with the exact configuration you want, but only 1 frontend node and 1 compute node. - Power off both nodes and create disk images from them (instructions for Google Cloud at [3], but all cloud providers have a similar functionality) - Change your config file to use the snapshotted disks as `image_id` instead of the pristine OS; note you will need different snapshots / disk images for frontend and compute nodes. So your config file will be something like: # ... rest of config file as-is [cluster/myslurmcluster/frontend] image_id=my-frontend-disk-image # ... rest of config file as-is [cluster/myslurmcluster/compute] image_id=my-compute-disk-image # ... rest of config file as-is - Change the number of nodes to match the intended usage and start the real cluster. #3. Use `mitogen` Install the `mitogen` Python package following instructions at https://mitogen.networkgenomics.com/ansible_detailed.html (Note: this won't work if you are using the Dockerized version of ElastiCluster aka `elasticluster.sh`) If `mitogen` is present, ElastiCluster will use it automatically to speed up Ansible's SSH connections; benefits are evident especially in conjuction with #2. #4. Adjust ANSIBLE_FORKS Contrary to what Google's online article states, picking a fixed `ansible_forks=` value isn't the best option. The optimal number depends on the number of CPU cores and the network bandwidth *of the control machine* (i.e., the one where you're running the `elasticluster` command). It does *not* depend on the size of the cluster being built. It takes a bit of experimentation to find the optimal number; I normally start at 4x the number of local CPU cores, keep an eye on the CPU and network utilization, and then adjust (down if you see CPU or network being saturated, up if they're not). Please let me know if this is clear enough, and, above all, if it helps :-) Ciao, R [1]: https://github.com/elasticluster/elasticluster/issues/365 [2]: https://github.com/elasticluster/elasticluster/blob/master/docs/presentations/hepix2017/slides.pdf [3]: https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#create_image -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticluster/CAJGE3zWM0_4ncSz_ixOfb1iYiZZ0u8Qcw4D-iVYtaxFa%3DEQzCA%40mail.gmail.com.
