[GitHub] vishaalkapoor commented on issue #14026: [Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure

GitBox Fri, 08 Feb 2019 14:47:36 -0800

vishaalkapoor commented on issue #14026: [Nightly Test Failure] Tutorial 
test_tutorials.test_gluon_end_to_end Test Failure
URL: 
https://github.com/apache/incubator-mxnet/issues/14026#issuecomment-461973064
 
 
   There's a connection issue in the logs. Perhaps has to do with running a 
docker image and being sandboxed in some manner.
   
   As per 
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/221/pipeline
   
   ```
   ERROR:root:An error occurred while executing the following cell:
   
   ------------------
   
   def test(net, val_data, ctx):
   
       metric = mx.metric.Accuracy()
   
       for i, (data, label) in enumerate(val_data):
   
           data = gluon.utils.split_and_load(data, ctx_list=ctx, 
even_split=False)
   
           label = gluon.utils.split_and_load(label, ctx_list=ctx, 
even_split=False)
   
           outputs = [net(x) for x in data]
   
           metric.update(label, outputs)
   
       return metric.get()
   
   
   
   trainer = gluon.Trainer(finetune_net.collect_params(), 
optimizer=sgd_optimizer)
   
   
   
   # start with epoch 1 for easier learning rate calculation
   
   for epoch in range(1, epochs + 1):
   
   
   
       tic = time.time()
   
       train_loss = 0
   
       metric.reset()
   
   
   
       for i, (data, label) in enumerate(train_data):
   
           # get the images and labels
   
           data = gluon.utils.split_and_load(data, ctx_list=ctx, 
even_split=False)
   
           label = gluon.utils.split_and_load(label, ctx_list=ctx, 
even_split=False)
   
           with autograd.record():
   
               outputs = [finetune_net(x) for x in data]
   
               loss = [softmax_cross_entropy(yhat, y) for yhat, y in 
zip(outputs, label)]
   
           for l in loss:
   
               l.backward()
   
   
   
           trainer.step(batch_size)
   
           train_loss += sum([l.mean().asscalar() for l in loss]) / len(loss)
   
           metric.update(label, outputs)
   
   
   
       _, train_acc = metric.get()
   
       train_loss /= num_batch
   
       _, val_acc = test(finetune_net, val_data, ctx)
   
   
   
       print('[Epoch %d] Train-acc: %.3f, loss: %.3f | Val-acc: %.3f | 
learning-rate: %.3E | time: %.1f' %
   
             (epoch, train_acc, train_loss, val_acc, trainer.learning_rate, 
time.time() - tic))
   
   
   
   _, test_acc = test(finetune_net, test_data, ctx)
   
   print('[Finished] Test-acc: %.3f' % (test_acc))
   
   ------------------
   ```
   
   ```
   ---------------------------------------------------------------------------
   
   ConnectionRefusedError                    Traceback (most recent call last)
   
   <ipython-input-6-cfd10a99e63e> in <module>
   
        17     metric.reset()
   
        18 
   
   ---> 19     for i, (data, label) in enumerate(train_data):
   
        20         # get the images and labels
   
        21         data = gluon.utils.split_and_load(data, ctx_list=ctx, 
even_split=False)
   
   
   
   /work/mxnet/python/mxnet/gluon/data/dataloader.py in __next__(self)
   
       441         assert self._rcvd_idx in self._data_buffer, "fatal error 
with _push_next, rcvd_idx missing"
   
       442         ret = self._data_buffer.pop(self._rcvd_idx)
   
   --> 443         batch = pickle.loads(ret.get()) if self._dataset is None 
else ret.get()
   
       444         if self._pin_memory:
   
       445             batch = _as_in_context(batch, context.cpu_pinned())
   
   
   
   /work/mxnet/python/mxnet/gluon/data/dataloader.py in rebuild_ndarray(pid, 
fd, shape, dtype)
   
        55             fd = multiprocessing.reduction.rebuild_handle(fd)
   
        56         else:
   
   ---> 57             fd = fd.detach()
   
        58         return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, 
shape, dtype))
   
        59 
   
   
   
   /usr/lib/python3.5/multiprocessing/resource_sharer.py in detach(self)
   
        55         def detach(self):
   
        56             '''Get the fd.  This should only be called once.'''
   
   ---> 57             with _resource_sharer.get_connection(self._id) as conn:
   
        58                 return reduction.recv_handle(conn)
   
        59 
   
   
   
   /usr/lib/python3.5/multiprocessing/resource_sharer.py in 
get_connection(ident)
   
        85         from .connection import Client
   
        86         address, key = ident
   
   ---> 87         c = Client(address, 
authkey=process.current_process().authkey)
   
        88         c.send((key, os.getpid()))
   
        89         return c
   
   
   
   /usr/lib/python3.5/multiprocessing/connection.py in Client(address, family, 
authkey)
   
       485         c = PipeClient(address)
   
       486     else:
   
   --> 487         c = SocketClient(address)
   
       488 
   
       489     if authkey is not None and not isinstance(authkey, bytes):
   
   
   
   /usr/lib/python3.5/multiprocessing/connection.py in SocketClient(address)
   
       612     with socket.socket( getattr(socket, family) ) as s:
   
       613         s.setblocking(True)
   
   --> 614         s.connect(address)
   
       615         return Connection(s.detach())
   
       616 
   
   
   ConnectionRefusedError: [Errno 111] Connection refused
   
   ConnectionRefusedError: [Errno 111] Connection refused
   
   ```
   
   Re: Docker
   One possible issue, make sure you're using the right platform and runtime. 
First line of 
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/NightlyTestsForBinaries/branches/master/runs/221/nodes/76/steps/248/log/?start=0
   
   `+ ci/build.py --docker-registry mxnetci --nvidiadocker --platform 
ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m 
/work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu`
   
   Additionally, try a different region and/or use a higher verbosity with 
docker.
   Vishaal


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] vishaalkapoor commented on issue #14026: [Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure

Reply via email to