roywei commented on issue #14026: [Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure URL: https://github.com/apache/incubator-mxnet/issues/14026#issuecomment-462016640 @marcoabreu @Chancebair could you reopen this issue? **On the test failure** I think the root cause is shared memory on docker is too small, leading to Gluon DataLoader hanging when using multi worker. according to issue: https://github.com/apache/incubator-mxnet/issues/11872 That's why this test is passing on local and only fails on docker. **On docker reproduction failure** I m still not able to build the dependency, according to [step 2.2 here](https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results), still the same error after changing regions. without building dependencies, running the test will have a mxnet lib not found error. building dependency on g3.8xlarge with cuda9.1, cudnn7, nvidia-docker2 using the following command ``` ci/build.py --docker-registry mxnetci --nvidiadocker -p ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_cuda91_cudnn7 ``` give the following error: ``` Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 256, in _raise_for_status response.raise_for_status() File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 940, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.35/containers/bf3bde2e3acca965512b321d1fd0df150cbf904a7dc0dfa04c207758d980e56a/start During handling of the above exception, another exception occurred: Traceback (most recent call last): File "ci/build.py", line 582, in <module> sys.exit(main()) File "ci/build.py", line 506, in main local_ccache_dir=args.ccache_dir, cleanup=cleanup, environment=environment) File "ci/build.py", line 307, in container_run environment=environment) File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", line 791, in run container.start() File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", line 392, in start return self.client.api.start(self.id, **kwargs) File "/usr/local/lib/python3.5/dist-packages/docker/utils/decorators.py", line 19, in wrapped return f(self, resource_id, *args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/docker/api/container.py", line 1091, in start self._raise_for_status(res) File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 258, in _raise_for_status raise create_api_error_from_http_exception(e) File "/usr/local/lib/python3.5/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception raise cls(e, response=response, explanation=explanation) docker.errors.APIError: 500 Server Error: Internal Server Error ("OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.1 --pid=21215 /var/lib/docker/overlay2/514b187cccda325ae75d5a526a1b060aa2d301708c6fc7c712529289ab2179ca/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown") ``` running the test without dependency built: ``` ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu ``` gives mxnet not found error.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
