Small update to give you some background: We have been able to get the CI back to a stable state - thanks to Pedro and Kellen! Reason for this issue was a required security update related to the Spectre-vulnerability https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-384/+bug/1741807. This update was not compatible to the installed nvidia-docker version and thus broke our CI. I have installed all updates, validated that nvidia-docker is working again and started a new set of mxnet-linux-gpu-slaves. If any issues arise, please don't hesitate to drop a quick message on this thread.
-Marco On Wed, Jan 10, 2018 at 6:45 PM, Marco de Abreu < [email protected]> wrote: > Hello, > > recently, Nvidia released a new version of their cuda and gpu drivers for > Ubuntu16.04. This updated has been applied automatically while the slaves > were running, which caused the nvidia-docker-daemon to disconnect. Due to > the update requiring a restart, the daemon was not able to reconnect and > caused the error 'nvml: Driver/library version mismatch'. We have restarted > all slaves to apply the update. > > In future, we plan to explicitly disallow automated updates of all > nvidia-related drivers. > > Best regards, > Marco > >
