indhub commented on issue #4253: training-accuracy nan
URL: 
https://github.com/apache/incubator-mxnet/issues/4253#issuecomment-379191066
 
 
   The "Epoch[x] Train-accuracy=xxx" line printed at the end of every epoch 
gives the impression that the accuracy metric is for the entire epoch. In 
reality, it is NOT.
   
   For example if an epoch consists of 101 batches and we were printing metrics 
every 10 batch because callback was created that way, what will be printed as 
'Train-accuracy=xxx' at the end of an epoch is actually just the accuracy from 
a single batch (101'st batch). Printing this as 'Epoch[x] Train-accuracy=xxx' 
is very misleading.
   
   Ideally we should remove 
[this](https://github.com/apache/incubator-mxnet/blob/62bf9ec16886434d5e4eac279db19cb18c5c9c45/python/mxnet/module/base_module.py#L535)
 misleading print statement. But then, I'm sure there is a lot of existing 
scripts out there that look for this statement. We can't remove this without 
breaking those scripts. Since this will be a breaking change, let's do it in a 
major version change.
   
   For now, to avoid the nan error, we can avoid resetting the metrics when 
processing callback for the last batch. It will be reset at the beginning of 
the next epoch anyway.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to