[GitHub] roywei commented on issue #14026: [Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure

GitBox Fri, 08 Feb 2019 21:53:42 -0800

roywei commented on issue #14026: [Nightly Test Failure] Tutorial 
test_tutorials.test_gluon_end_to_end Test Failure
URL: 
https://github.com/apache/incubator-mxnet/issues/14026#issuecomment-462016640
 
 
   @marcoabreu @Chancebair  could you reopen this issue?
   
   **On the test failure** 
   I think the root cause is shared memory on docker is too small, leading to 
Gluon DataLoader hanging when using multi worker.
   according to issue: https://github.com/apache/incubator-mxnet/issues/11872
   
   That's why this test is passing on local and only fails on docker.
   
   
   
   **On docker reproduction failure**
   I m still not able to build the dependency, according to [step 2.2 
here](https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results),
 still the same error after changing regions.
   without building dependencies, running the test will have a mxnet lib not 
found error.
   
   building dependency on g3.8xlarge with cuda9.1, cudnn7, nvidia-docker2 using 
the following command 
   ```
   ci/build.py --docker-registry mxnetci --nvidiadocker  -p ubuntu_build_cuda 
/work/runtime_functions.sh build_ubuntu_gpu_cuda91_cudnn7   
   ```
   give the following error:
   ```
   Traceback (most recent call last):
     File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 
256, in _raise_for_status
       response.raise_for_status()
     File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 
940, in raise_for_status
       raise HTTPError(http_error_msg, response=self)
   requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for 
url: 
http+docker://localhost/v1.35/containers/bf3bde2e3acca965512b321d1fd0df150cbf904a7dc0dfa04c207758d980e56a/start
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "ci/build.py", line 582, in <module>
       sys.exit(main())
     File "ci/build.py", line 506, in main
       local_ccache_dir=args.ccache_dir, cleanup=cleanup, 
environment=environment)
     File "ci/build.py", line 307, in container_run
       environment=environment)
     File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", 
line 791, in run
       container.start()
     File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", 
line 392, in start
       return self.client.api.start(self.id, **kwargs)
     File "/usr/local/lib/python3.5/dist-packages/docker/utils/decorators.py", 
line 19, in wrapped
       return f(self, resource_id, *args, **kwargs)
     File "/usr/local/lib/python3.5/dist-packages/docker/api/container.py", 
line 1091, in start
       self._raise_for_status(res)
     File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 
258, in _raise_for_status
       raise create_api_error_from_http_exception(e)
     File "/usr/local/lib/python3.5/dist-packages/docker/errors.py", line 31, 
in create_api_error_from_http_exception
       raise cls(e, response=response, explanation=explanation)
   docker.errors.APIError: 500 Server Error: Internal Server Error ("OCI 
runtime create failed: container_linux.go:344: starting container process 
caused "process_linux.go:424: 
   container init caused \"process_linux.go:407: running prestart hook 1 caused 
\\\"error running hook: exit status 1, stdout: , stderr: exec command: 
[/usr/bin/nvidia-container-cli --load-kmods configure 
--ldconfig=@/sbin/ldconfig.real --device=all --compute --utility 
--require=cuda>=9.1 
   --pid=21215 
/var/lib/docker/overlay2/514b187cccda325ae75d5a526a1b060aa2d301708c6fc7c712529289ab2179ca/merged]\\\\nnvidia-container-cli:
 initialization error: driver error: failed to process request\\\\n\\\"\"": 
unknown")
   ```
   
   
   running the test without dependency built:
   ```
   ci/build.py --docker-registry mxnetci --nvidiadocker --platform 
ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m 
/work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu
   ```
   gives 
   mxnet not found error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] roywei commented on issue #14026: [Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure

Reply via email to