matteosal opened a new issue, #21111: URL: https://github.com/apache/incubator-mxnet/issues/21111
This script creates a batchnorm and runs it 3 times: 1) A first test mode evaluation 2) A dummy training mode evaluation 3) A second test mode evaluation The outputs of (1) and (3) are compared under various circumstances: CPU vs GPU, cudnn batchnorm ON vs OFF, evaluation (2) with vs without backward pass. ``` import mxnet as mx import numpy as np from mxnet import autograd def testStateChange(backward, device, cudnn): print() print('backward: ' + str(backward) + ', device: ' + str(device) + ', cudnn: ' + str(cudnn)) sym = mx.symbol.BatchNorm( *[mx.symbol.Variable(name) for name in shapes.keys()], eps=0.001, fix_gamma=False, use_global_stats=False, axis=1, cudnn_off=not(cudnn) ) op = mx.ndarray.CachedOp(sym) if(device == mx.cpu()): arguments = args_cpu else: arguments = args_gpu # First evaluation in test mode out1 = op(*arguments, default_ctx=device) # Dummy evaluation in training mode, with or without backward if(backward): with autograd.record(train_mode=True): [arg.attach_grad() for arg in arguments] dummy = op(*arguments, default_ctx=device) autograd.backward(dummy, head_grads=mx.np.ones([1, 2, 3], ctx=device)) else: with autograd.train_mode(): op(*arguments, default_ctx=device) # Second evaluation in test mode out2 = op(*arguments, default_ctx=device) if(np.isnan(np.sum(out1.asnumpy()))): print('out1 has nans!') if(np.isnan(np.sum(out2.asnumpy()))): print('out2 has nans!') # Check if the dummy evaluation in training mode has changed the state of the # batchnorm. If out1 and out2 are different, the state was changed print(mx.np.max(mx.np.abs(out1 - out2))) print("**** cudnn batchnorm inconsistency") shapes = {'input': [1, 2, 3], 'gamma': [2], 'beta': [2], 'mean': [2], 'var': [2]} args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in shapes.values()] args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu] testStateChange(False, mx.cpu(), False) testStateChange(True, mx.cpu(), False) testStateChange(False, mx.gpu(), False) testStateChange(True, mx.gpu(), False) testStateChange(False, mx.gpu(), True) testStateChange(True, mx.gpu(), True) print("\n\n**** cudnn batchnorm nan") shapes = {'input': [1, 6], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': [6]} args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in shapes.values()] args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu] testStateChange(False, mx.gpu(), True) ``` I get this output from the above script: ``` **** cudnn batchnorm inconsistency backward: False, device: cpu(0), cudnn: False 0.0 backward: True, device: cpu(0), cudnn: False 0.045242727 backward: False, device: gpu(0), cudnn: False 0.0 backward: True, device: gpu(0), cudnn: False 0.045242667 backward: False, device: gpu(0), cudnn: True 0.044606388 backward: True, device: gpu(0), cudnn: True 0.043622255 **** cudnn batchnorm nan backward: False, device: gpu(0), cudnn: True out2 has nans! nan ``` This shows 2 problems: 1) The dummy training mode evaluation can change the values of the moving mean and variance thus making out1 and out2 differ sometimes, but it is inconsistent in doing so. The "cudnn batchnorm inconsistency" outputs shows that moving arrays are normally changed only if a BACKWARD pass in training mode is performed, but on GPU + cudnn they are changed by the FORWARD (case `backward: False, device: gpu(0), cudnn: True`) 2) The "cudnn batchnorm nan" output shows that the cudnn batchnorm can also output nan when alternating training and test mode evaluations with certain input shapes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org