> > Start your large cluster from node snapshots > I already use a custom image but I don't differentiate between frontend and compute nodes hence, they both use the same custom image (or snapshot, assuming they are basically the same thing).
Use larger nodes Unfortunately, multi-core nodes aren't useful for me because I am testing scalability of my program so each node should spend same amount of time for communication. Intra-node communication would spoil results since it is much faster than inter-node communication. Do you have any deadlines for your 1000-node cluster? I am at the end of my PhD work so I should finish simulation ASAP. I am happy to hear that you are going to improve configuration time but even if you pull configuration time from 20 min to 10 min per 10 nodes, for 1000 nodes this means nearly 30 hrs which is still not acceptable if simulation itself takes 1 hour to complete. I am just pointing out that cloud HPC is not cost-efficient in development and testing stage when frequent (parallel) debugging is needed and cluster cannot be kept open and should be closed immediately after usage to save money. But validated codes would benefit a lot from improvement in configuration time. On Tue, May 22, 2018 at 12:47 PM, Riccardo Murri <[email protected]> wrote: > Hi Orxan, all, > > > Elasticluster spent nearly two hours for configuration of a cluster with > 37 > > nodes. > > Yes, this is definitely a pain point with ElastiCluster/Ansible ATM. > I'll try to > summarize the issue and give some suggestions here. > > My rule of thumb for time it takes to set up a basic SLURM cluster > with ElastiCluster is ~20 minutes every 10 nodes; that can quickly > become ~25 per 10 nodes if you are installing add-on software (e.g., > Ganglia) or if you have very bad SSH connection latency. I'd say your > experience of 2 hrs per ~40 nodes is in that ballpark. > > > Considering that I am going to use a 1000-node cluster this means a > > lot time hence money for just configuration. Is there a way to speed up > the > > configuration time? > > Yes: give me part of the money to work on scalability features :-) > > Srsly, what you can do *now* to cut down set up time (in decreasing > order of effectiveness): > > * Start your large cluster from node snapshots: > > 1. Create a cluster like the one you are about to start, but much > smaller (1 frontend + 1 compute node is enough) > 2. Make snapshots of the frontend and the compute node (and any > other node type you are using, e.g., GlusterFS data servers) > 3. Modify the large cluster configuration to use these snapshots > instead of the base OS images: > > [cluster/my-large-cluster] > # ... usual config > > [cluster/my-large-cluster/frontend] > image_id = id-of-frontend-snapshot > > [cluster/my-large-cluster/compute] > image_id = id-of-compute-snapshot > > This allows Ansible to "fast forward" on many time-consuming tasks > (e.g., installation of packages) > > * Use larger nodes -- setup time scales linearly with the number of > *nodes*, so you can get a cluster with the same number of cores but > fewer nodes (hence, quicker to setup) by using larger nodes. > > * Set environmental variable ANSIBLE_FORKS to a higher value: > ElastiCluster defaults to ANSIBLE_FORKS=10 but you should be able to > set this to 4x or 6x the number of cores in your ElastiCluster VM > safely. This allows more nodes to be set up at the same time. > > Lastly, I can make more stuff option (e.g., the "HPC standard" stuff) > -- there was some discussion on this maling list quite some time ago, > where people basically suggested that the basic install be kept as > minimal as possible. I have not given this task much priority up to > now, but it can be done relatively quickly. Do you have any deadlines > for your 1000-node cluster? > > More details and current plans for overcoming the issue at: > https://github.com/gc3-uzh-ch/elasticluster/issues/365 > > I'd be glad for any suggestions and a more in-depth discussion. > > Ciao, > R > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
