jeremiedb commented on issue #7968: [R] Transfer Learning using VGG-16
URL: 
https://github.com/apache/incubator-mxnet/issues/7968#issuecomment-355477039
 
 
   Just made a few tests with different ResNet models and I also experienced 
crashes. 
   
   Issue appears tied with a memory that isn't released during training. No 
problem with ResNet34 or 50, but it got problematic with 101. Have you looked 
at the GPU usage immediatly after launching the training (nvidia-smi) to 
confirm you have same issue? 
   
   I also noticed apparent memory leak when running large embeddings. A quick 
turnaround is to add a gc() within the training loop after each couple of batch 
(not necessary to add a gc() within the eval data loop). You can do it either 
in `mx.model.FeedForward.create` or `mx.model.buckets` (I only used the later 
but should work for the usual training function). Good news is that it doesn't 
slow down noticeably the training and finetune ResNet101 wasn't crashing 
anymore and GPU memory remained below 4Go on 8 samples. 
   
   @thirdwing Any idea whether this memory issue could better be handled than 
with gc()? If performance isn't affected, I wonder if a quick PR with the gc() 
would be worth. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to