Hi Orxan, all,
> Elasticluster spent nearly two hours for configuration of a cluster with 37
> nodes.
Yes, this is definitely a pain point with ElastiCluster/Ansible ATM.
I'll try to
summarize the issue and give some suggestions here.
My rule of thumb for time it takes to set up a basic SLURM cluster
with ElastiCluster is ~20 minutes every 10 nodes; that can quickly
become ~25 per 10 nodes if you are installing add-on software (e.g.,
Ganglia) or if you have very bad SSH connection latency. I'd say your
experience of 2 hrs per ~40 nodes is in that ballpark.
> Considering that I am going to use a 1000-node cluster this means a
> lot time hence money for just configuration. Is there a way to speed up the
> configuration time?
Yes: give me part of the money to work on scalability features :-)
Srsly, what you can do *now* to cut down set up time (in decreasing
order of effectiveness):
* Start your large cluster from node snapshots:
1. Create a cluster like the one you are about to start, but much
smaller (1 frontend + 1 compute node is enough)
2. Make snapshots of the frontend and the compute node (and any
other node type you are using, e.g., GlusterFS data servers)
3. Modify the large cluster configuration to use these snapshots
instead of the base OS images:
[cluster/my-large-cluster]
# ... usual config
[cluster/my-large-cluster/frontend]
image_id = id-of-frontend-snapshot
[cluster/my-large-cluster/compute]
image_id = id-of-compute-snapshot
This allows Ansible to "fast forward" on many time-consuming tasks
(e.g., installation of packages)
* Use larger nodes -- setup time scales linearly with the number of
*nodes*, so you can get a cluster with the same number of cores but
fewer nodes (hence, quicker to setup) by using larger nodes.
* Set environmental variable ANSIBLE_FORKS to a higher value:
ElastiCluster defaults to ANSIBLE_FORKS=10 but you should be able to
set this to 4x or 6x the number of cores in your ElastiCluster VM
safely. This allows more nodes to be set up at the same time.
Lastly, I can make more stuff option (e.g., the "HPC standard" stuff)
-- there was some discussion on this maling list quite some time ago,
where people basically suggested that the basic install be kept as
minimal as possible. I have not given this task much priority up to
now, but it can be done relatively quickly. Do you have any deadlines
for your 1000-node cluster?
More details and current plans for overcoming the issue at:
https://github.com/gc3-uzh-ch/elasticluster/issues/365
I'd be glad for any suggestions and a more in-depth discussion.
Ciao,
R
--
You received this message because you are subscribed to the Google Groups
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.