vishaalkapoor commented on issue #14026: [Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure URL: https://github.com/apache/incubator-mxnet/issues/14026#issuecomment-461973064 There's a connection issue in the logs. Perhaps has to do with running a docker image and being sandboxed in some manner. As per http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/221/pipeline ``` ERROR:root:An error occurred while executing the following cell: ------------------ def test(net, val_data, ctx): metric = mx.metric.Accuracy() for i, (data, label) in enumerate(val_data): data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False) label = gluon.utils.split_and_load(label, ctx_list=ctx, even_split=False) outputs = [net(x) for x in data] metric.update(label, outputs) return metric.get() trainer = gluon.Trainer(finetune_net.collect_params(), optimizer=sgd_optimizer) # start with epoch 1 for easier learning rate calculation for epoch in range(1, epochs + 1): tic = time.time() train_loss = 0 metric.reset() for i, (data, label) in enumerate(train_data): # get the images and labels data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False) label = gluon.utils.split_and_load(label, ctx_list=ctx, even_split=False) with autograd.record(): outputs = [finetune_net(x) for x in data] loss = [softmax_cross_entropy(yhat, y) for yhat, y in zip(outputs, label)] for l in loss: l.backward() trainer.step(batch_size) train_loss += sum([l.mean().asscalar() for l in loss]) / len(loss) metric.update(label, outputs) _, train_acc = metric.get() train_loss /= num_batch _, val_acc = test(finetune_net, val_data, ctx) print('[Epoch %d] Train-acc: %.3f, loss: %.3f | Val-acc: %.3f | learning-rate: %.3E | time: %.1f' % (epoch, train_acc, train_loss, val_acc, trainer.learning_rate, time.time() - tic)) _, test_acc = test(finetune_net, test_data, ctx) print('[Finished] Test-acc: %.3f' % (test_acc)) ------------------ ``` ``` --------------------------------------------------------------------------- ConnectionRefusedError Traceback (most recent call last) <ipython-input-6-cfd10a99e63e> in <module> 17 metric.reset() 18 ---> 19 for i, (data, label) in enumerate(train_data): 20 # get the images and labels 21 data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False) /work/mxnet/python/mxnet/gluon/data/dataloader.py in __next__(self) 441 assert self._rcvd_idx in self._data_buffer, "fatal error with _push_next, rcvd_idx missing" 442 ret = self._data_buffer.pop(self._rcvd_idx) --> 443 batch = pickle.loads(ret.get()) if self._dataset is None else ret.get() 444 if self._pin_memory: 445 batch = _as_in_context(batch, context.cpu_pinned()) /work/mxnet/python/mxnet/gluon/data/dataloader.py in rebuild_ndarray(pid, fd, shape, dtype) 55 fd = multiprocessing.reduction.rebuild_handle(fd) 56 else: ---> 57 fd = fd.detach() 58 return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype)) 59 /usr/lib/python3.5/multiprocessing/resource_sharer.py in detach(self) 55 def detach(self): 56 '''Get the fd. This should only be called once.''' ---> 57 with _resource_sharer.get_connection(self._id) as conn: 58 return reduction.recv_handle(conn) 59 /usr/lib/python3.5/multiprocessing/resource_sharer.py in get_connection(ident) 85 from .connection import Client 86 address, key = ident ---> 87 c = Client(address, authkey=process.current_process().authkey) 88 c.send((key, os.getpid())) 89 return c /usr/lib/python3.5/multiprocessing/connection.py in Client(address, family, authkey) 485 c = PipeClient(address) 486 else: --> 487 c = SocketClient(address) 488 489 if authkey is not None and not isinstance(authkey, bytes): /usr/lib/python3.5/multiprocessing/connection.py in SocketClient(address) 612 with socket.socket( getattr(socket, family) ) as s: 613 s.setblocking(True) --> 614 s.connect(address) 615 return Connection(s.detach()) 616 ConnectionRefusedError: [Errno 111] Connection refused ConnectionRefusedError: [Errno 111] Connection refused ``` Re: Docker One possible issue, make sure you're using the right platform and runtime. First line of http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/NightlyTestsForBinaries/branches/master/runs/221/nodes/76/steps/248/log/?start=0 `+ ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu` Additionally, try a different region and/or use a higher verbosity with docker. Vishaal
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
