[GitHub] [incubator-mxnet] gilbertfrancois commented on issue #18751: gluon.nn.BatchNorm seems to swap updated values of moving_mean and moving_var on GPU.

GitBox Tue, 21 Jul 2020 17:51:59 -0700


gilbertfrancois commented on issue #18751:
URL: 
https://github.com/apache/incubator-mxnet/issues/18751#issuecomment-662179530



   Many thanks for all the help and the swift responses.
   
   TL;DR
   Adding BatchNorm at the end of a feature extractor, and computing on GPU, 
and having a batch size of 1, results in a NaN. 
   
   ---
   ## Description
   
   I've created a small test file, which does the following:
   
   It creates a network with a Conv2D feature extractor, followed by a **tail** 
(HybridSequential). The tail is used to create a custom sized embeddings 
output. For training, an additional Dense layer is used with N_CLASSES output 
units.
   
   First variant has BatchNorm layers in the tail:
   ```
   MyNet(
     (features): HybridSequential(
       ...some layers...
     )
     (tail): HybridSequential(
       (0): Flatten
       (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, 
use_global_stats=False, in_channels=2048)
       (2): Dense(2048 -> 128, linear)
       (3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, 
use_global_stats=False, in_channels=32)
     )
     (output): Dense(128 -> 10, linear)
   )
   ```
   Second variant has no BatchNorm layers in the tail:
   ```
   MyNet2(
     (features): HybridSequential(
       ...some layers...
     )
     (tail): HybridSequential(
       (0): Flatten
       (1): Dense(2048 -> 128, linear)
     )
     (output): Dense(128 -> 10, linear)
   )
   ```
   
   The script tests 8 cases:
   - MyNet with and without BN layers in the tail
   - On CPU and GPU 
   - Feed forward with batch of shape _(1, 3, 224, 224)_ and _(2, 3, 224, 
224)_, computing `y_out` and `y_embeddings`.
   
   The output of the test is as follows:
   
   ```
         ctx  with_batchnorm                                              y_out 
                                      y_embeddings       input_shape
   0  cpu(0)            True  [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... 
 [[-0.15872642, -0.21955031, -0.9183478, -0.140...  (1, 3, 224, 224)
   1  gpu(0)            True  [[-2.9428682e-12, 1.359429e-13, -3.246836e-12,... 
 [[nan, nan, nan, nan, nan, nan, nan, nan, nan,...  (1, 3, 224, 224)
   2  cpu(0)           False  [[-0.027459817, -0.032691482, -0.060781483, 0.... 
 [[0.045531094, 0.22243185, -0.8400582, 0.12976...  (1, 3, 224, 224)
   3  gpu(0)           False  [[-0.027459746, -0.03269219, -0.06078052, 0.01... 
 [[0.08575564, -0.0057277326, -0.027809529, 0.0...  (1, 3, 224, 224)
   4  cpu(0)            True  [[0.020913437, -0.03557911, -0.03905518, -0.21... 
 [[0.63409245, -0.15498748, 0.5944583, 0.206986...  (2, 3, 224, 224)
   5  gpu(0)            True  [[0.02108069, -0.035640452, -0.038803764, -0.2... 
 [[0.09521756, -0.050475493, 0.0824465, -0.0377...  (2, 3, 224, 224)
   6  cpu(0)           False  [[-0.13125925, 0.023433281, 0.0269559, -0.0618... 
 [[-0.90178394, 0.27470633, -0.19833195, 0.5378...  (2, 3, 224, 224)
   7  gpu(0)           False  [[-0.1312578, 0.023436725, 0.026953252, -0.061... 
 [[-0.09999961, 0.0913848, -0.11836913, 0.03843...  (2, 3, 224, 224)
   ```
   
   There are a few observations:
   - MyNet with BatchNorm layers on GPU gives NaN for `y_embeddings` when 
_n-samples = 1_.
   - MyNet with BatchNorm layers on GPU gives a matrix with real numbers for 
`y_embeddings` when _n-samples > 1_.
   - The results for `y_train` on CPU and GPU are all close for similar test 
cases.
   - The results for `y_embeddings` on CPU and GPU are never close.
   - Removing layer (3) in MyNet1 does not help to avoid the NaN in 
`y_embeddings`. 
   - I don't understand why `y_out` from MyNet with BatchNorm on GPU still 
contains real numbers, given that the layer before outputs NaNs?
   
   It can very well be that @wkcn has a point. If m-1 is used in the 
denominator for the computation of the sample variance, that might explain why 
we see NaN for a batch with a single sample and real numbers for larger batches.
   
   
   ## To reproduce
   Install: `mxnet-cu102` (or one that matches your cuda version), `gluoncv`, 
`pandas`. Then run:
   ```
   curl --retry 10 -s 
https://gist.githubusercontent.com/gilbertfrancois/888f81042f5edaa42b1011d28264cff4/raw/d6e2609e4132d21a8bbd318265e007f94418b84e/bn_test_2.py
 | python
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-mxnet] gilbertfrancois commented on issue #18751: gluon.nn.BatchNorm seems to swap updated values of moving_mean and moving_var on GPU.

Reply via email to