eric-haibin-lin opened a new issue #17332: Race condition in downloading model 
from model zoo in parallel 
URL: https://github.com/apache/incubator-mxnet/issues/17332
 
 
   When i use horovod for training, and call
   
   `model = get_model(model_name, pretrained=True)`
   
   It complains with
   
   ```
   Exception in thread Thread-5:
   Traceback (most recent call last):
     File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
       self.run()
     File "/usr/lib/python3.6/threading.py", line 864, in run
       self._target(*self._args, **self._kwargs)
     File "tests/python/unittest/test_gluon_model_zoo.py", line 32, in fn
       model = get_model(model_name, pretrained=True, root='parallel_model/')
     File 
"/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/vision/__init__.py", line 
152, in get_model
       return models[name](**kwargs)
     File 
"/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/vision/mobilenet.py", line 
375, in mobilenet_v2_0_25
       return get_mobilenet_v2(0.25, **kwargs)
     File 
"/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/vision/mobilenet.py", line 
250, in get_mobilenet_v2
       get_model_file('mobilenetv2_%s' % version_suffix, root=root), ctx=ctx)
     File "/home/ubuntu/src/mxnet/python/mxnet/gluon/model_zoo/model_store.py", 
line 115, in get_model_file
       os.remove(zip_file_path)
   FileNotFoundError: [Errno 2] No such file or directory: 
'parallel_model/mobilenetv2_0.25-ae8f9392.zip'
   ```
   The get_model API breaks if multiple processes are doing it at the same 
time. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to