[gc3-uzh-ch/elasticluster] 5fc5a9: Add support for GPUs on Google Cloud

Riccardo Murri Thu, 18 Jan 2018 13:51:37 -0800

  Branch: refs/heads/master
  Home:   https://github.com/gc3-uzh-ch/elasticluster
  Commit: 5fc5a991f1634337819d8f67edc858ab08c4eec4
      
https://github.com/gc3-uzh-ch/elasticluster/commit/5fc5a991f1634337819d8f67edc858ab08c4eec4
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)


  Changed paths:
    M docs/configure.rst
    M elasticluster/conf.py
    M elasticluster/providers/gce.py
    M elasticluster/validate.py
    A examples/slurm-with-gpu-on-google.conf
    M tests/test_conf.py

  Log Message:
  -----------
  Add support for GPUs on Google Cloud

Many thanks to @benpass for providing the initial implementation
in PR #406!


  Commit: 32567cb120697af45c3ac8900dd765bf2f830580
      
https://github.com/gc3-uzh-ch/elasticluster/commit/32567cb120697af45c3ac8900dd765bf2f830580
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2

  Log Message:
  -----------
  slurm-common: Cosmetic changes to `slurm.conf`


  Commit: b27240f7ae0152f06ce0276ca6df1d10eb267f6b
      
https://github.com/gc3-uzh-ch/elasticluster/commit/b27240f7ae0152f06ce0276ca6df1d10eb267f6b
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M docs/playbooks.rst
    A 
elasticluster/share/playbooks/roles/slurm-worker/files/etc/slurm/cgroup/release_agent
    A 
elasticluster/share/playbooks/roles/slurm-worker/files/etc/slurm/cgroup_allowed_devices_file.conf
    A 
elasticluster/share/playbooks/roles/slurm-worker/files/usr/local/sbin/elasticluster-check-kconfig-cgroups.sh
    A elasticluster/share/playbooks/roles/slurm-worker/tasks/cgroup.yml
    M elasticluster/share/playbooks/roles/slurm-worker/tasks/main.yml
    A elasticluster/share/playbooks/roles/slurm-worker/templates/cgroup.conf.j2
    A elasticluster/share/playbooks/roles/slurm-worker/vars/main.yml

  Log Message:
  -----------
  SLURM: Support use of cgroups (opt-in).

Configure cgroup support in SLURM if any one of the cgroup-based
plugins (`task/cgroup`, `jobacct_gather/cgroup`, or
`proctrack/cgroup`) is configured in `slurm.conf`.

*Note:* SLURM's cgroup support requires that swap accounting be
enabled in the kernel.  This is *not* the default on Debian and
Ubuntu, and a reboot is needed to enable it.  ElastiCluster will by
default try to configure the bootloader but *not* reboot the nodes --
use `global_var_allow_reboot=yes` to change this default and reboot
the nodes if needed.


  Commit: 50a997209ba1a3a9c2c77b95ca5da8226cc6b404
      
https://github.com/gc3-uzh-ch/elasticluster/commit/50a997209ba1a3a9c2c77b95ca5da8226cc6b404
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    A elasticluster/share/playbooks/library/gpus
    M elasticluster/share/playbooks/roles/slurm-common/tasks/main.yml
    M elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2
    M elasticluster/share/playbooks/roles/slurm-worker/tasks/main.yml
    A elasticluster/share/playbooks/roles/slurm-worker/templates/gres.conf.j2
    M elasticluster/share/playbooks/site.yml

  Log Message:
  -----------
  Configure SLURM's GRES with GPUs (if available)


  Commit: 0c6f0490830f3a404e8f1561bbceab7d76f9f7d7
      
https://github.com/gc3-uzh-ch/elasticluster/commit/0c6f0490830f3a404e8f1561bbceab7d76f9f7d7
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/cluster.py

  Log Message:
  -----------
  Better message for "instance running" check.

Let's say the instance is "up" instead of "up and running", because
the latter suggests that we can connect and use it any time, and that
is typically not true for instances that are just starting (booting
times can still be a few minutes).


  Commit: 2b8dd3e9c0f126c2d8c97df4984099c1c7abc5fe
      
https://github.com/gc3-uzh-ch/elasticluster/commit/2b8dd3e9c0f126c2d8c97df4984099c1c7abc5fe
  Author: Hatef Monajemi <monaj...@stanford.edu>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    A elasticluster/share/playbooks/roles/cuda/tasks/init-Debian.yml
    A elasticluster/share/playbooks/roles/cuda/tasks/init-RedHat.yml
    A elasticluster/share/playbooks/roles/cuda/tasks/main.yml
    M elasticluster/share/playbooks/site.yml

  Log Message:
  -----------
  New role `cuda` to automatically install CUDA if GPUs are detected.


  Commit: 486a9869222826f1e7fa37a6838ee2823287ff66
      
https://github.com/gc3-uzh-ch/elasticluster/commit/486a9869222826f1e7fa37a6838ee2823287ff66
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    A elasticluster/share/playbooks/library/bootparam.py

  Log Message:
  -----------
  New Ansible module `bootparam.py` to alter Linux boot command-line.


  Commit: 4a1ebf210f2da4ec8ccd88ddedce9d114c37e461
      
https://github.com/gc3-uzh-ch/elasticluster/commit/4a1ebf210f2da4ec8ccd88ddedce9d114c37e461
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M docs/playbooks.rst
    M elasticluster/share/playbooks/roles/slurm-common/defaults/main.yml
    M elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2

  Log Message:
  -----------
  SLURM: Allow configuring more parameters in `slurm.conf` through setup 
variables.

Specifically, it is now possible to set variables in the `[setup/*]`
section to assign values to the following SLURM configuration
parameters:

* `FastSchedule` (default 1)
* `JobAcctGatherFrequency` (default 60)
* `JobAcctGatherType` (default `jobacct_gather/linux`)
* `MaxArraySize` (default 1000)
* `MaxJobCount` (default 10000)
* `ProcTrackType` (default `proctrack/linuxproc`)
* `ReturnToService` (default 1)
* `SelectType` (default `select/cons_res`)
* `SelectTypeParameters` (default `CR_Core_Memory`)
* `TaskPlugin` (default `task/none`)

The ElastiCluster setup variable name corresponding to a SLURM
parameter name is the lowercased name prefixed with `slurm_`.  For
instance, SLURM parameter `FastSchedule` can be configured via the
variable `slurm_fastschedule`.  (Note that SLURM parameter names are
not case-sensitive, but ElastiCluster variable names are.)

Default values have not changed from previous ElastiCluster releases.


  Commit: 2aa414e7ac9038c71e090edfce4ea1d538420cd9
      
https://github.com/gc3-uzh-ch/elasticluster/commit/2aa414e7ac9038c71e090edfce4ea1d538420cd9
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/bigtop/tasks/main.yml

  Log Message:
  -----------
  Bigtop: only update APT cache if repo was added in this run


  Commit: 05474b82c2b718390077301acbe290e1860705f7
      
https://github.com/gc3-uzh-ch/elasticluster/commit/05474b82c2b718390077301acbe290e1860705f7
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M docs/playbooks.rst
    M elasticluster/share/playbooks/library/gpus
    A elasticluster/share/playbooks/roles/cuda.yml
    A elasticluster/share/playbooks/roles/cuda/defaults/main.yml
    A elasticluster/share/playbooks/roles/cuda/tasks/_check_nvidia_dev.yml
    A elasticluster/share/playbooks/roles/cuda/tasks/_reboot_and_wait.yml
    M elasticluster/share/playbooks/roles/cuda/tasks/init-Debian.yml
    M elasticluster/share/playbooks/roles/cuda/tasks/init-RedHat.yml
    M elasticluster/share/playbooks/roles/cuda/tasks/main.yml
    A 
elasticluster/share/playbooks/roles/cuda/templates/etc/profile.d/cuda.csh.j2
    A 
elasticluster/share/playbooks/roles/cuda/templates/etc/profile.d/cuda.sh.j2
    A 
elasticluster/share/playbooks/roles/cuda/templates/etc/yum.repos.d/cuda.repo.j2
    A elasticluster/share/playbooks/roles/cuda/vars/main.yml
    M elasticluster/share/playbooks/roles/slurm-worker/tasks/cgroup.yml
    M elasticluster/share/playbooks/site.yml

  Log Message:
  -----------
  CUDA: Many role improvements.

In particular:

* support role on CentOS/RHEL as well
* allow setting CUDA version via setup variable
* ensure CUDA binaries are found in the PATH of login shells
* document the new role


  Commit: add3000ecee7aa8ac6654d1143279c52c78dfea6
      
https://github.com/gc3-uzh-ch/elasticluster/commit/add3000ecee7aa8ac6654d1143279c52c78dfea6
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2

  Log Message:
  -----------
  SLURM: allow setting `DefMemPerCPU` through setup variables.


  Commit: 3df1a756e5a7ed4fc97868f64faa4e4f45f614f4
      
https://github.com/gc3-uzh-ch/elasticluster/commit/3df1a756e5a7ed4fc97868f64faa4e4f45f614f4
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/slurm-worker/tasks/main.yml

  Log Message:
  -----------
  SLURM: Fix YAML syntax error in task "Install SLURM worker packages"


  Commit: 87de4cd6bfbc04aef75219cdba47e4992c06166a
      
https://github.com/gc3-uzh-ch/elasticluster/commit/87de4cd6bfbc04aef75219cdba47e4992c06166a
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/common/tasks/init-RedHat.yml

  Log Message:
  -----------
  CentOS/RHEL: Upgrade all installed packages to latest version

This is necessary in order to get correct kernel and headers,
in case we need to compile additional device drivers (e.g., for CUDA).


  Commit: a9b2e1e37b5a59ec68de97aba5a0e5c7a9331244
      
https://github.com/gc3-uzh-ch/elasticluster/commit/a9b2e1e37b5a59ec68de97aba5a0e5c7a9331244
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    A examples/slurm-with-gpu-on-aws.conf
    M examples/slurm-with-gpu-on-google.conf

  Log Message:
  -----------
  Update GPU-accelerated cluster examples.


  Commit: 9c3b3dfee4c328c9837475bdee2647e430299287
      
https://github.com/gc3-uzh-ch/elasticluster/commit/9c3b3dfee4c328c9837475bdee2647e430299287
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M docs/playbooks.rst

  Log Message:
  -----------
  Convert SLURM variables table to list-table format.

Makes it waaaay easier to edit descriptions...


  Commit: 66a6bad78e8bd3ec02301ca9a3f58d7266f6fb59
      
https://github.com/gc3-uzh-ch/elasticluster/commit/66a6bad78e8bd3ec02301ca9a3f58d7266f6fb59
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M docs/playbooks.rst
    M elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2
    M elasticluster/share/playbooks/roles/slurm-worker/templates/cgroup.conf.j2

  Log Message:
  -----------
  SLURM: Use ``slurm_allowedramspace`` and ``slurm_allowedswapspace`` to 
compute the total ``VSizeFactor``.


  Commit: cf1d1f04f6ec3810686ad1463ed31ef9a56be8e1
      
https://github.com/gc3-uzh-ch/elasticluster/commit/cf1d1f04f6ec3810686ad1463ed31ef9a56be8e1
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M docs/playbooks.rst
    M elasticluster/share/playbooks/roles/slurm-common/defaults/main.yml
    M elasticluster/share/playbooks/roles/slurm-common/templates/slurm.conf.j2

  Log Message:
  -----------
  SLURM: Default for `ReturnToService` is now `2`.

If CUDA or the `task/cgroup` plugins are used, it is possible that a
reboot happens during the installation.  When the nodes come back up,
SLURM will mark them as "down" since the reboot was unexpected, and
wait for a sysadmin to issue `sudo scontrol update nodename=... state=resume`.
With `ReturnToService=2`, nodes where `slurmd` is running will be
automatically returned to "idle" state, which is what most
people (likely) want.


  Commit: 692da1b032e5cea0aa98be14f344a94574e71605
      
https://github.com/gc3-uzh-ch/elasticluster/commit/692da1b032e5cea0aa98be14f344a94574e71605
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/cuda/tasks/main.yml

  Log Message:
  -----------
  Temporary workaround for incompatibility between newest Ubuntu kernel and 
`nvidia-387` driver.


  Commit: 11c74ff90c0c6e566721269f61d0f7f7731d3e3c
      
https://github.com/gc3-uzh-ch/elasticluster/commit/11c74ff90c0c6e566721269f61d0f7f7731d3e3c
  Author: Riccardo Murri <riccardo.mu...@gmail.com>
  Date:   2018-01-18 (Thu, 18 Jan 2018)

  Changed paths:
    M elasticluster/share/playbooks/roles/cuda/tasks/_reboot_and_wait.yml

  Log Message:
  -----------
  CUDA: Fix installation on Ubuntu 14.04

Installation on Ubuntu 14.04 *does* indeed require a reboot,
plus long times compiling the driver for two different kernel
versions.


Compare: 
https://github.com/gc3-uzh-ch/elasticluster/compare/3b14ca56a167...11c74ff90c0c

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster-dev+unsubscr...@googlegroups.com.
To post to this group, send email to elasticluster-dev@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticluster-dev/5a6116c8965de_53f92abd3deb1c147583c%40hookshot-fe-cace476.cp1-iad.github.net.mail.
For more options, visit https://groups.google.com/d/optout.

[gc3-uzh-ch/elasticluster] 5fc5a9: Add support for GPUs on Google Cloud

Reply via email to