Dear Riccardo,

Many thanks for your extensive answer!
Regarding your suggestions:

#1. This is indeed a quick and interesting solution. In fact, given my
little knowledge of parallel computing, I was not aware that I could
achieve the same parallelization with the two setups. Just to make sure I
understand, let's suppose I would like to run 20 R sessions in parallel,
where each session uses 4 CPUs. Is it possible to do so even if I have a
cluster made of 4 nodes with 20 cores each?

#2--4. Thanks for these tips, I will try them out and I will let you know
how they work :-)

Thanks again for all your help!
Nicola


On Mon, Oct 5, 2020 at 3:16 PM Riccardo Murri <[email protected]>
wrote:

> Hello Nicola,
>
> thanks for your question!  I realize this is an important issue for
> ElastiCluster (perhaps *the* single most important issue [1]), but I
> have done nothing to document workarounds properly.  I welcome
> suggestions on where/how to mention this in the publicly available
> documentation!
>
> First of all: ElastiCluster depends critically on Ansible, and Ansible
> is slow.  So, there is not much that can be done to *radically* speed
> configuring clusters up.  It's a design limitation of ElastiCluster.
>
> That said, there are a few mitigations that can be applied:
>
> #1. Use larger nodes
>
> Given the configuration you posted, I presume you're following
> Google's "Running R at scale" guide; that guide sets up a cluster for
> spreading single-threaded R functions across a set of compute cores.
> In this case, you are interested in the total number of *cores* that
> the cluster provides, not so much in their distribution across nodes
> (as would be the case, e.g., if you were running a hybrid MPI/OpenMP
> application).
>
> So here's the trick: use fewer larger nodes!  4 nodes with 20 cores
> each will be configured ~5x faster than 20 nodes with 4 cores each.
>
>
> #2. Use snapshots
>
> This will help if you plan on deleting and re-creating clusters with
> the same set of installed software over time; for instance, if you are
> running need to spin up R clusters 4+ times over the course of a few
> months, or if you are going to install a large cluster (say, >50
> nodes).  It will *not* help with one-off small cluster setups.
>
> What takes a lot of time is the initial installation and configuration
> of software, which has to be repeated for each node.  The idea here is
> to do this once, snapshot the running cluster, and use it as a base
> for building other clusters.
>
> The procedure requires a bit of manual intervention:
>
> - Start a cluster with the exact configuration you want, but only 1
> frontend node and 1 compute node.
> - Power off both nodes and create disk images from them (instructions
> for Google Cloud at [3], but all cloud providers have a similar
> functionality)
> - Change your config file to use the snapshotted disks as `image_id`
> instead of the pristine OS; note you will need different snapshots /
> disk images for frontend and compute nodes.  So your config file will
> be something like:
>
>   # ... rest of config file as-is
>
>   [cluster/myslurmcluster/frontend]
>   image_id=my-frontend-disk-image
>   # ... rest of config file as-is
>
>   [cluster/myslurmcluster/compute]
>   image_id=my-compute-disk-image
>   # ... rest of config file as-is
>
> - Change the number of nodes to match the intended usage and start the
> real cluster.
>
>
> #3. Use `mitogen`
>
> Install the `mitogen` Python package following instructions at
> https://mitogen.networkgenomics.com/ansible_detailed.html
>
> (Note: this won't work if you are using the Dockerized version of
> ElastiCluster aka `elasticluster.sh`)
>
> If `mitogen` is present, ElastiCluster will use it automatically to
> speed up Ansible's SSH connections; benefits are evident especially in
> conjuction with #2.
>
>
> #4. Adjust ANSIBLE_FORKS
>
> Contrary to what Google's online article states, picking a fixed
> `ansible_forks=` value isn't the best option.  The optimal number
> depends on the number of CPU cores and the network bandwidth *of the
> control machine* (i.e., the one where you're running the
> `elasticluster` command).  It does *not* depend on the size of the
> cluster being built.
>
> It takes a bit of experimentation to find the optimal number; I
> normally start at 4x the number of local CPU cores, keep an eye on the
> CPU and network utilization, and then adjust (down if you see CPU or
> network being saturated, up if they're not).
>
>
> Please let me know if this is clear enough, and, above all, if it helps :-)
>
> Ciao,
> R
>
>
> [1]: https://github.com/elasticluster/elasticluster/issues/365
> [2]:
> https://github.com/elasticluster/elasticluster/blob/master/docs/presentations/hepix2017/slides.pdf
> [3]:
> https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#create_image
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticluster/CACzCOqGBUUUhxC_xRf4h4dNokEs0mmm%2BSHbcsDyqA3fSmVEMqg%40mail.gmail.com.

Reply via email to