cybaj opened a new issue #18960: URL: https://github.com/apache/incubator-mxnet/issues/18960
## Description (A clear and concise description of what the bug is.) ### TL;DR I want to build image which contains the library needed `mxnet` dependecy. So I added installation of the library and `mxnet` at Dockerfile. `mxnet` package was installed fine. But build was failed with `OSError: libcuda.so.1: cannot open shared object file: No such file or directory` at installation of the library. So I added `LD_LIBRARY_PATH` too. But in this case, not like before, any gpus were detected. ### cuda I used `nvcr.io/nvidia/pytorch:19.10-py3` image. Which contains below. [ref](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_19-10.html#rel_19-10) - NVIDIA CUDA 10.1.243 including cuBLAS 10.2.1.243 - NVIDIA cuDNN 7.6.4 So I installed with pypi `mxnet-cu101`. And I have checked `libcuda.so.1` exists in `/usr/local/cuda/compat/lib.real` ### Error Message (Paste the complete error message. Please also include stack trace by setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=10` before running your script.) 1. Cannot install python package which have mxnet dependency with log below. ``` Step 5/9 : RUN pip install git+https://github.com/cybaj/KoGPT2.git#egg=kogpt2 ---> Running in 2ede86c70b10 Collecting kogpt2 from git+https://github.com/cybaj/KoGPT2.git#egg=kogpt2 Cloning https://github.com/cybaj/KoGPT2.git to /tmp/pip-install-kawcrvv2/kogpt2 Running command git clone -q https://github.com/cybaj/KoGPT2.git /tmp/pip-install-kawcrvv2/kogpt2 ERROR: Command errored out with exit status 1: command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kawcrvv2/kogpt2/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kawcrvv2/kogpt2/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info cwd: /tmp/pip-install-kawcrvv2/kogpt2/ Complete output (21 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-kawcrvv2/kogpt2/setup.py", line 1, in <module> from kogpt2 import __version__ File "/tmp/pip-install-kawcrvv2/kogpt2/kogpt2/__init__.py", line 15, in <module> from . import model File "/tmp/pip-install-kawcrvv2/kogpt2/kogpt2/model/__init__.py", line 17, in <module> from .gpt import * File "/tmp/pip-install-kawcrvv2/kogpt2/kogpt2/model/gpt.py", line 24, in <module> import mxnet as mx File "/opt/conda/lib/python3.6/site-packages/mxnet/__init__.py", line 24, in <module> from .context import Context, current_context, cpu, gpu, cpu_pinned File "/opt/conda/lib/python3.6/site-packages/mxnet/context.py", line 24, in <module> from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass File "/opt/conda/lib/python3.6/site-packages/mxnet/base.py", line 214, in <module> _LIB = _load_lib() File "/opt/conda/lib/python3.6/site-packages/mxnet/base.py", line 205, in _load_lib lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL) File "/opt/conda/lib/python3.6/ctypes/__init__.py", line 348, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcuda.so.1: cannot open shared object file: No such file or directory ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. The command '/bin/sh -c pip install git+https://github.com/cybaj/KoGPT2.git#egg=kogpt2' returned a non-zero code: 1 ``` 2. If I add `LD_LIBRARY_PATH` like below (which contains `libcuda.so.1`), then installation succeeds, but NO gpu detected. ``` Step 9/10 : RUN python -c "import torch; print(torch.__version__); print(torch.cuda.device_count());" ---> Running in 8273db124d7f 1.3.0a0+24ae9b5 0 Removing intermediate container 8273db124d7f ---> a8f922092018 Step 10/10 : RUN python -c "import mxnet; print(mxnet.__version__); print(mxnet.util.get_gpu_count());" ---> Running in 59c987a815bc 1.6.0 0 ``` Without installation python library which need mxnet dep, all gpus dectected. ## To Reproduce Docker build with Dockerfile below. ``` FROM nvcr.io/nvidia/pytorch:19.10-py3 ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/local/cuda/compat/lib.real RUN pip install --no-cache-dir mxnet_cu101 RUN pip install --no-cache-dir gluonnlp sentencepiece RUN pip install git+https://github.com/cybaj/KoGPT2.git#egg=kogpt2 # this need mxnet RUN pip install transformers==2.11.0 WORKDIR /workspace RUN python -c "import torch; print(torch.__version__); print(torch.cuda.device_count());" RUN python -c "import mxnet; print(mxnet.__version__); print(mxnet.util.get_gpu_count());" ``` ### Steps to reproduce 1. Docker build with the Dockerfile 2. Check gpus detection. ``` RUN python -c "import torch; print(torch.__version__); print(torch.cuda.device_count());" RUN python -c "import mxnet; print(mxnet.__version__); print(mxnet.util.get_gpu_count());" ``` ## What have you tried to solve it? 1. Use other docker base image. 10.12 version. But failed. 2. Build within other machine (other docker, which of version is latest), but failed. ## Environment We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: ``` curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python # paste outputs here ``` 404 Not Found that diagnose.py url. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
