gilbertfrancois commented on issue #18751: URL: https://github.com/apache/incubator-mxnet/issues/18751#issuecomment-662179530
Many thanks for all the help and the swift responses. TL;DR Adding BatchNorm at the end of a feature extractor, and computing on GPU, and having a batch size of 1, results in a NaN. --- ## Description I've created a small test file, which does the following: It creates a network with a Conv2D feature extractor, followed by a **tail** (HybridSequential). The tail is used to create a custom sized embeddings output. For training, an additional Dense layer is used with N_CLASSES output units. First variant has BatchNorm layers in the tail: ``` MyNet( (features): HybridSequential( ...some layers... ) (tail): HybridSequential( (0): Flatten (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048) (2): Dense(2048 -> 128, linear) (3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=32) ) (output): Dense(128 -> 10, linear) ) ``` Second variant has no BatchNorm layers in the tail: ``` MyNet2( (features): HybridSequential( ...some layers... ) (tail): HybridSequential( (0): Flatten (1): Dense(2048 -> 128, linear) ) (output): Dense(128 -> 10, linear) ) ``` The script tests 8 cases: - MyNet with and without BN layers in the tail - On CPU and GPU - Feed forward with batch of shape _(1, 3, 224, 224)_ and _(2, 3, 224, 224)_, computing `y_out` and `y_embeddings`. The output of the test is as follows: ``` ctx with_batchnorm y_out y_embeddings input_shape 0 cpu(0) True [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [[-0.15872642, -0.21955031, -0.9183478, -0.140... (1, 3, 224, 224) 1 gpu(0) True [[-2.9428682e-12, 1.359429e-13, -3.246836e-12,... [[nan, nan, nan, nan, nan, nan, nan, nan, nan,... (1, 3, 224, 224) 2 cpu(0) False [[-0.027459817, -0.032691482, -0.060781483, 0.... [[0.045531094, 0.22243185, -0.8400582, 0.12976... (1, 3, 224, 224) 3 gpu(0) False [[-0.027459746, -0.03269219, -0.06078052, 0.01... [[0.08575564, -0.0057277326, -0.027809529, 0.0... (1, 3, 224, 224) 4 cpu(0) True [[0.020913437, -0.03557911, -0.03905518, -0.21... [[0.63409245, -0.15498748, 0.5944583, 0.206986... (2, 3, 224, 224) 5 gpu(0) True [[0.02108069, -0.035640452, -0.038803764, -0.2... [[0.09521756, -0.050475493, 0.0824465, -0.0377... (2, 3, 224, 224) 6 cpu(0) False [[-0.13125925, 0.023433281, 0.0269559, -0.0618... [[-0.90178394, 0.27470633, -0.19833195, 0.5378... (2, 3, 224, 224) 7 gpu(0) False [[-0.1312578, 0.023436725, 0.026953252, -0.061... [[-0.09999961, 0.0913848, -0.11836913, 0.03843... (2, 3, 224, 224) ``` There are a few observations: - MyNet with BatchNorm layers on GPU gives NaN for `y_embeddings` when _n-samples = 1_. - MyNet with BatchNorm layers on GPU gives a matrix with real numbers for `y_embeddings` when _n-samples > 1_. - The results for `y_train` on CPU and GPU are all close for similar test cases. - The results for `y_embeddings` on CPU and GPU are never close. - Removing layer (3) in MyNet1 does not help to avoid the NaN in `y_embeddings`. - I don't understand why `y_out` from MyNet with BatchNorm on GPU still contains real numbers, given that the layer before outputs NaNs? It can very well be that @wkcn has a point. If m-1 is used in the denominator for the computation of the sample variance, that might explain why we see NaN for a batch with a single sample and real numbers for larger batches. ## To reproduce Install: `mxnet-cu102` (or one that matches your cuda version), `gluoncv`, `pandas`. Then run: ``` curl --retry 10 -s https://gist.githubusercontent.com/gilbertfrancois/888f81042f5edaa42b1011d28264cff4/raw/d6e2609e4132d21a8bbd318265e007f94418b84e/bn_test_2.py | python ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org