Dear Riccardo, Thank you very much for your detailed answers. One thing I still do not have clear is how adding/removing nodes from the cluster affects SLURM (the queue, the state information, etc.) and jobs already running. Is it equivalent to "scontrol reconfigure" where you have to restart slurmctld every time you add or remove a node? Or another mechanism? What if we want to do these changes of nodes frequently, how does this affect the users?
Thank you, Ana On Tuesday, 17 January 2017 21:12:43 UTC+1, Riccardo Murri wrote: > > Dear Ana, > > (Ana Jokanović, Tue, Jan 17, 2017 at 06:34:44AM -0800:) > > I understand that ElastiCluster installs SLURM on the newly created > > cluster. > > Actually, ElastiCluster just runs Ansible on the cluster; depending on > the "groups" that your config file defines for a node, that node will > get software installed and configured on it. That is to say, you could > install SLURM+GlusterFS or GridEngine+Hadoop+Ganglia if you found that > combination useful. > > > > Is SLURM source code within ElastiCluster source code? Where can I > > find it? > > No, ElastiCluster installs SLURM from pre-compiled packages. The are the > SLURM packages that come from the distribution's main archive (Debian, > Ubuntu), or the packages from the SLURM COPR [1] by @verdurin on > CentOS/RHEL. > > [1]: https://copr.fedorainfracloud.org/coprs/verdurin/slurm/ > > In a former release, ElastiCluster was downloading the SLURM sources > from the SchedMD website and compiling them. This makes configuring a > cluster much slower for basically no gain, so it was dropped in favor of > installing from precompiled packages. > > > > May I substitute it with another (modified) version of SLURM? > > Yes, as long as you package it and know how to edit Ansible playbooks. > > > > Also, can I edit slurm.conf and where can I find it? > > Here: > https://github.com/gc3-uzh-ch/elasticluster/blob/master/elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2 > > > If you do not want to mess with ElastiCluster sources, you can make your > own Ansible playbook that deploys your own customized `slurm.conf` and > then runs `scontrol reconfigure`. Assuming you named this playbook > `after.yml`, you can run the following command to have the custom > playbook run after ElastiCluster's main config:: > > elasticluster setup mycluster -- after.yml > > If you copy the `after.yml` playbook into the ElastiCluster sources, > directory `elasticluster/share/playbooks/` then it will automatically be > executed. > > Note that, since SLURM likes to embed the list of nodes and partitions > in the `slurm.conf` file, then you *have to* make the new `slurm.conf` a > template: ElastiCluster has to plug in the nodenames into it. A good > idea could be to start with the `.j2` file provided above and modify it. > > > > Which part of the ElastiCluster is responsible for resizing of the > cluster? > > It's the commands `elasticluster resize` and `elasticluster > remove-node`. > > Note that -for the time being- ElastiCluster's resize operations have to > be initiated by an admin; no action is ever triggered automatically. > > > > In SLURM's documentation I have found out about the Elastic computing > and > > possibility to resize the cluster through setting ResumeProgram an > > SuspendProgram in slum.conf > > (https://slurm.schedmd.com/elastic_computing.html). Is this how > > ElastiCluster interact with SLURM, as well? > > No, it's quite different. > > SLURM requires you to specify a set of nodes in `slurm.conf`, and then > you provide `ResumeProgram` and `SuspendProgram` scripts which create > these nodes as VMs in a IaaS cloud. The decision of when to start or > stop a node is left to the SLURM scheduler, and the process is fully > automatic. The cluster will not grow beyond the limits set in > `slurm.conf`. > > As ElastiCluster deals with many different software systems, not just > SLURM, it takes a completely different approach: you can add or remove > cluster nodes at any time. Every resize operation, however, triggers a > re-run of the Ansible playbooks to reconfigure the cluster to the new > setup. Depending on the installed software, this may lead to a downtime > in operations (should not happen with SLURM, but I'm not so sure about > e.g. GlusterFS). Also, resize operations must be initiated by an admin > and are never triggered automatically. > > Does this answer your questions? > > Ciao, > R > > -- > Riccardo Murri, Schwerzenbacherstrasse 2, CH-8606 Nänikon, Switzerland > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
