karan6181 commented on issue #17485: URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764860187
- I debugged the code a bit and found this line (https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L694) might be the culprit or things where we should start looking at. - For `batch_size=2 per GPU`, I think during training, the data loader created such a way that it passes 2 images per GPU during forward pass. However, Validation doesn't support multi-batch and it is always 1 image per GPU irrespective of user passes `batch_size=2 per GPU`. Since the model expects 2 images per GPU and for validation we are passing 1 image per GPU, we are seeing the above error. If I run the training with `batch_size=2 per GPU` and then save the model params and then run the validation by loading the same model params but with `batch_size=1 per GPU`, then validation works. So something to do with `per_device_batch_size` managed in https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/rcnn/faster_rcnn/faster_rcnn.py or https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/rcnn/mask_rcnn/rcnn_target.py. I might lack some background knowledge on this, but this is what I found. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
