Wallart edited a comment on issue #13037: ImageDetIter looping forever in MXNet-1.3.0 URL: https://github.com/apache/incubator-mxnet/issues/13037#issuecomment-439200178 I'm running my code on nvidia-docker containers (Ubuntu 17.10) with CUDA 9.2. I compiled each of my MXNet version from scratch with opencv and mkldnn support. On the hardware side I'm using 2 GTX 1080Ti. If we consider the following code snippet, using a dataset in ImageRecord format (approximately 200 images) : ``` cls_loss = FocalLoss() box_loss = SmoothL1Loss() cls_metric = mx.metric.Accuracy() box_metric = mx.metric.MAE() for epoch in range(start_epoch, epochs): # reset iterator and tick train_data.reset() cls_metric.reset() box_metric.reset() epoch_tick = time.time() # iterate through all batch for i, batch in enumerate(train_data): batch_tick = time.time() # record gradients with autograd.record(): x = batch.data[0].as_in_context(ctx) y = batch.label[0].as_in_context(ctx) default_anchors, class_predictions, box_predictions = net(x) box_target, box_mask, cls_target = training_targets(default_anchors, class_predictions, y) # losses loss1 = cls_loss(class_predictions, cls_target) loss2 = box_loss(box_predictions, box_target, box_mask) # sum all losses loss = loss1 + loss2 # backpropagate loss.backward() # apply trainer.step(batch_size) # update metrics cls_metric.update([cls_target], [nd.transpose(class_predictions, (0, 2, 1))]) box_metric.update([box_target], [box_predictions * box_mask]) if (i + 1) % log_interval == 0: name1, val1 = cls_metric.get() name2, val2 = box_metric.get() print('[Epoch %d Batch %d] speed: %f samples/s, training: %s=%f, %s=%f' % (epoch, i, batch_size / (time.time() - batch_tick), name1, val1, name2, val2)) # end of epoch logging name1, val1 = cls_metric.get() name2, val2 = box_metric.get() print('[Epoch %d] training: %s=%f, %s=%f' % (epoch, name1, val1, name2, val2)) print('[Epoch %d] time cost: %f' % (epoch, time.time() - epoch_tick)) ``` On MXNet 1.2.1 it will work as expected and the epochs will keeps flowing through the console > [Epoch 0] training: accuracy=0.833192, mae=0.004929 > [Epoch 0] time cost: 1.240091 > [Epoch 1] training: accuracy=0.966545, mae=0.004379 > [Epoch 1] time cost: 0.610014 > [Epoch 2] training: accuracy=0.976884, mae=0.003983 > [Epoch 2] time cost: 0.631764 > [Epoch 3] training: accuracy=0.983173, mae=0.004638 But on MXNet 1.3.0 an epoch will be divided to an infinite range of batches > [Epoch 0 Batch 19] speed: 1155.356185 samples/s, training: accuracy=0.923830, mae=0.004783 > [Epoch 0 Batch 39] speed: 1105.710115 samples/s, training: accuracy=0.954663, mae=0.004561 > [Epoch 0 Batch 59] speed: 1169.286568 samples/s, training: accuracy=0.966536, mae=0.004413 > [Epoch 0 Batch 79] speed: 1132.142250 samples/s, training: accuracy=0.973061, mae=0.004393 > [Epoch 0 Batch 99] speed: 1115.432219 samples/s, training: accuracy=0.977253, mae=0.004304 > [Epoch 0 Batch 119] speed: 1139.079420 samples/s, training: accuracy=0.980220, mae=0.004205
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services