Hi Jorge, I'm admittedly not too familiar with CUDA and tensorflow but the error message you describe sounds to me more like a build issue, i.e. it sounds like the version of the nvidia driver is different between the docker image and the host system?
Maybe you could continue investigating to see if this is related to the release itself or caused by some external cause, and create a JIRA ticket to capture your findings? Thanks, Benno On Fri, Apr 26, 2019 at 9:55 PM Jorge Machado <jom...@me.com> wrote: > Hi all, > > did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having > some issues using the cuda 10+ images when doing real processing. We still > need to check some things but basically we get: > > kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot find > working devices in this configuration > > > Logs: > > I0424 13:27:14.000586 30 executor.cpp:726] Forked command at 73 > Preparing rootfs at > '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b' > Marked '/' as rslave > Executing pre-exec command > '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}' > Executing pre-exec command > '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}' > Changing root to > /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b > 2019-04-24 13:27:18.346994: I > tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports > instructions that this TensorFlow binary was not compiled to use: AVX2 FMA > 2019-04-24 13:27:18.352203: E > tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: > CUDA_ERROR_UNKNOWN: unknown error > 2019-04-24 13:27:18.352243: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA > diagnostic information for host: __host__ > 2019-04-24 13:27:18.352252: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: __host__ > 2019-04-24 13:27:18.352295: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported > version is: 410.48.0 > 2019-04-24 13:27:18.352329: I > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported > version is: 418.56.0*2019-04-24 13:27:18.352338: E > tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306 > <http://cuda_diagnostics.cc:306>] kernel version 418.56.0 does not match DSO > version 410.48.0 -- cannot find working devices in this configuration* > 2019-04-24 13:27:18.374940: I > tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: > 2593920000 Hz > 2019-04-24 13:27:18.378793: I tensorflow/compiler/xla/service/service.cc:150] > XLA service 0x4f41e10 executing computations on platform Host. Devices: > 2019-04-24 13:27:18.378821: I tensorflow/compiler/xla/service/service.cc:158] > StreamExecutor device (0): <undefined>, <undefined> > W0424 13:27:18.385210 140191267731200 deprecation.py:323] From > /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: > colocate_with (from tensorflow.python.framework.ops) is deprecated and will > be removed in a future version. > Instructions for updating: > Colocations handled automatically by placer. > W0424 13:27:18.399287 140191267731200 deprecation.py:323] From > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129: > conv2d (from tensorflow.python.layers.convolutional) is deprecated and will > be removed in a future version. > Instructions for updating: > Use keras.layers.conv2d instead. > W0424 13:27:18.433226 140191267731200 deprecation.py:323] From > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261: > max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will > be removed in a future version. > Instructions for updating: > Use keras.layers.max_pooling2d instead. > W0424 13:27:20.197937 140191267731200 deprecation.py:323] From > /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209: > to_float (from tensorflow.python.ops.math_ops) is deprecated and will be > removed in a future version. > Instructions for updating: > Use tf.cast instead. > W0424 13:27:20.312573 140191267731200 deprecation.py:323] From > /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: > to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be > removed in a future version. > Instructions for updating: > Use tf.cast instead. > W0424 13:27:21.082763 140191267731200 deprecation.py:323] From > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/benchmark_cnn.py:2238: > __init__ (from tensorflow.python.training.supervisor) is deprecated and will > be removed in a future version. > Instructions for updating: > Please switch to tf.train.MonitoredTrainingSession > I0424 13:27:22.013817 140191267731200 session_manager.py:491] Running > local_init_op. > I0424 13:27:22.193911 140191267731200 session_manager.py:493] Done running > local_init_op. > 2019-04-24 13:27:23.181740: E tensorflow/core/common_runtime/executor.cc:624] > Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only > supports NHWC on device type CPU > [[{{node tower_0/v/cg/mpool0/MaxPool}}]] > I0424 13:27:23.262847 140191267731200 coordinator.py:224] Error reported to > Coordinator: <class > 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Default > MaxPoolingOp only supports NHWC on device type CPU > [[node tower_0/v/cg/mpool0/MaxPool (defined at > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261) ] > > running this on nvidia-docker2 works fine. > > image used: tensorflow/tensorflow:latest-gpu > > command: python > /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py > --num_gpus=1 --batch_size=32 --model=resnet50 > --variable_update=parameter_server > > on the host nvidia-smi says: NVIDIA-SMI 418.56 Driver Version: 418.56 > CUDA Version: 10.1 > > thx > > Jorge > > On 26 Apr 2019, at 18:28, Benno Evers <bev...@mesosphere.com> wrote: > > Hi all, > > Please vote on releasing the following candidate as Apache Mesos 1.8.0. > > > 1.8.0 includes the following: > > -------------------------------------------------------------------------------- > * Greatly reduced allocator cycle time. > * Operation feedback for v1 schedulers. > * Per-framework minimum allocatable resources. > * New CLI subcommands `task attach` and `task exec`. > * New `linux/seccomp` isolator. > * Support for Docker v2 Schema2 manifest format. > * XFS quota for persistent volumes. > * **Experimental** Support for the new CSI v1 API. > > In addition, 1.8.0-rc2 includes the following changes: > > --------------------------------------------------------------------------------- > * Docker manifest v2s2 config with image GC. > * Expanded `highlights` section in the CHANGELOG. > > In addition, 1.8.0-rc3 includes the following changes: > > --------------------------------------------------------------------------------- > * Relaxed protobuf union validation strictness. (MESOS-9740) > * Fixed a bug causing non-uniform random results in the random sorter. > (MESOS-9733) > > > The CHANGELOG for the release is available at: > > https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc3 > > -------------------------------------------------------------------------------- > > The candidate for Mesos 1.8.0 release is available at: > https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz > > The tag to be voted on is 1.8.0-rc3: > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc3 > > The SHA512 checksum of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.sha512 > > The signature of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc3/mesos-1.8.0.tar.gz.asc > > The PGP key used to sign the release is here: > https://dist.apache.org/repos/dist/release/mesos/KEYS > > The JAR is in a staging repository here: > https://repository.apache.org/content/repositories/orgapachemesos-1253 > > Please vote on releasing this package as Apache Mesos 1.8.0! > > The vote is open until and passes if a majority of at least 3 +1 PMC votes > are cast. > > [ ] +1 Release this package as Apache Mesos 1.8.0 > [ ] -1 Do not release this package because ... > > Thanks, > Benno and Joseph > > > -- Benno Evers Software Engineer, Mesosphere