>From the log you attached, it seems that you're using Mesos containerizer, so a docker pull won't affect Mesos. Can you verify if the error occurs with the latest nvidia/cuda image?
On Wed, May 1, 2019, 4:25 PM Chun-Hung Hsiao <chhs...@mesosphere.io> wrote: > Hi Jorge, > > Can you provide the output of `docker run --rm -ti nvidia/cuda ls > /usr/local/cuda-10.1/compat/`? > It seems that the nvidia kernel driver installed on your host has version > 418, but the image you're using is version 410. > The lastest `nvidia/cuda` image uses version 418 as well. > Can you also do a `docker pull nvidia/cuda` then try again with Mesos 1.8? > > On Fri, Apr 26, 2019 at 1:03 PM Jorge Machado <jom...@me.com.invalid> > wrote: > >> Hi all, >> >> did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having >> some issues using the cuda 10+ images when doing real processing. We still >> need to check some things but basically we get: >> kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot >> find working devices in this configuration >> >> Logs: >> I0424 13:27:14.000586 30 executor.cpp:726] Forked command at 73 >> Preparing rootfs at >> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b' >> Marked '/' as rslave >> Executing pre-exec command >> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}' >> Executing pre-exec command >> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}' >> Changing root to >> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b >> 2019-04-24 13:27:18.346994: I >> tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports >> instructions that this TensorFlow binary was not compiled to use: AVX2 FMA >> 2019-04-24 13:27:18.352203: E >> tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: >> CUDA_ERROR_UNKNOWN: unknown error >> 2019-04-24 13:27:18.352243: I >> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA >> diagnostic information for host: __host__ >> 2019-04-24 13:27:18.352252: I >> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: __host__ >> 2019-04-24 13:27:18.352295: I >> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported >> version is: 410.48.0 >> 2019-04-24 13:27:18.352329: I >> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported >> version is: 418.56.0 >> 2019-04-24 13:27:18.352338: E >> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306] kernel version >> 418.56.0 does not match DSO version 410.48.0 -- cannot find working devices >> in this configuration >> 2019-04-24 13:27:18.374940: I >> tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: >> 2593920000 Hz >> 2019-04-24 13:27:18.378793: I >> tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f41e10 >> executing computations on platform Host. Devices: >> 2019-04-24 13:27:18.378821: I >> tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device >> (0): <undefined>, <undefined> >> W0424 13:27:18.385210 140191267731200 deprecation.py:323] From >> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: >> colocate_with (from tensorflow.python.framework.ops) is deprecated and will >> be removed in a future version. >> Instructions for updating: >> Colocations handled automatically by placer. >> W0424 13:27:18.399287 140191267731200 deprecation.py:323] From >> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129: >> conv2d (from tensorflow.python.layers.convolutional) is deprecated and will >> be removed in a future version. >> Instructions for updating: >> Use keras.layers.conv2d instead. >> W0424 13:27:18.433226 140191267731200 deprecation.py:323] From >> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261: >> max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and >> will be removed in a future version. >> Instructions for updating: >> Use keras.layers.max_pooling2d instead. >> W0424 13:27:20.197937 140191267731200 deprecation.py:323] From >> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209: >> to_float (from tensorflow.python.ops.math_ops) is deprecated and will be >> removed in a future version. >> Instructions for updating: >> Use tf.cast instead. >> W0424 13:27:20.312573 140191267731200 deprecation.py:323] From >> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: >> to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be >> removed in a future version. >> Instructions for updating: >> Use tf.cast instead. >> W0424 13:27:21.082763 140191267731200 deprecation.py:323] From >> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: >> __init__ (from tensorflow.python.training.supervisor) is deprecated and >> will be removed in a future version. >> Instructions for updating: >> Please switch to tf.train.MonitoredTrainingSession >> I0424 13:27:22.013817 140191267731200 session_manager.py:491] Running >> local_init_op. >> I0424 13:27:22.193911 140191267731200 session_manager.py:493] Done >> running local_init_op. >> 2019-04-24 13:27:23.181740: E >> tensorflow/core/common_runtime/executor.cc:624] Executor failed to create >> kernel. Invalid argument: Default MaxPoolingOp only supports NHWC on device >> type CPU >> [[{{node tower_0/v/cg/mpool0/MaxPool}}]] >> I0424 13:27:23.262847 140191267731200 coordinator.py:224] Error reported >> to Coordinator: <class >> 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Default >> MaxPoolingOp only supports NHWC on device type CPU >> [[node tower_0/v/cg/mpool0/MaxPool (defined at >> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261) ] >> running this on nvidia-docker2 works fine. >> image used: tensorflow/tensorflow:latest-gpu >> command: python >> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py >> --num_gpus=1 --batch_size=32 --model=resnet50 >> --variable_update=parameter_server >> on the host nvidia-smi says: NVIDIA-SMI 418.56 Driver Version: >> 418.56 CUDA Version: 10.1 >> thx >> Jorge >> > On 26 Apr 2019, at 18:28, Benno Evers <bev...@mesosphere.com> wrote: >> > >> > Hi all, >> > >> > Please vote on releasing the following candidate as Apache Mesos 1.8.0. >> > >> > >> > 1.8.0 includes the following: >> > >> -------------------------------------------------------------------------------- >> > * Greatly reduced allocator cycle time. >> > * Operation feedback for v1 schedulers. >> > * Per-framework minimum allocatable resources. >> > * New CLI subcommands `task attach` and `task exec`. >> > * New `linux/seccomp` isolator. >> > * Support for Docker v2 Schema2 manifest format. >> > * XFS quota for persistent volumes. >> > * **Experimental** Support for the new CSI v1 API. >> > >> > In addition, 1.8.0-rc2 includes the following changes: >> > >> --------------------------------------------------------------------------------- >> > * Docker manifest v2s2 config with image GC. >> > * Expanded `highlights` section in the CHANGELOG. >> > >> > In addition, 1.8.0-rc3 includes the following changes: >> > >> --------------------------------------------------------------------------------- >> > * Relaxed protobuf union validation strictness. (MESOS-9740) >> > * Fixed a bug causing non-uniform random results in the random sorter. >> > (MESOS-9733) >> > >> > >> > The CHANGELOG for the release is available at: >> > >> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3 >> > >> -------------------------------------------------------------------------------- >> > >> > The candidate for Mesos 1.8.0 release is available at: >> > >> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz >> > >> > The tag to be voted on is 1.8.0-rc3: >> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3 >> > >> > The SHA512 checksum of the tarball can be found at: >> > >> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512 >> > >> > The signature of the tarball can be found at: >> > >> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc >> > >> > The PGP key used to sign the release is here: >> > https://dist.apache.org/repos/dist/release/mesos/KEYS >> > >> > The JAR is in a staging repository here: >> > https://repository.apache.org/content/repositories/orgapachemesos-1253 >> > >> > Please vote on releasing this package as Apache Mesos 1.8.0! >> > >> > The vote is open until and passes if a majority of at least 3 +1 PMC >> votes >> > are cast. >> > >> > [ ] +1 Release this package as Apache Mesos 1.8.0 >> > [ ] -1 Do not release this package because ... >> > >> > Thanks, >> > Benno and Joseph >> >>