JonTanS opened a new issue #17960: Bert Training fails on MXNet 2.0
URL: https://github.com/apache/incubator-mxnet/issues/17960
 
 
   ## Description
   Running bert training on MXNet 2.0 (Master Branch) fails. When I pip install 
MXNet 1.6 the training script runs fine. 
   
   ### Error Message
   ```
   [23:02:13] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
   INFO:root:Using training data at 
/home/ubuntu/.mxnet/datasets/bert_input/part-000.npz
   INFO:root:1 files found.
   [23:02:59] src/operator/contrib/../tensor/./../../common/utils.h:472: 
MXNET_SAFE_ACCUMULATION=1 is recommended for LayerNorm with float16 inputs. See 
https://mxnet.apache.org/api/faq/env_var for more details.
   [23:03:00] src/operator/nn/./../../common/utils.h:472: 
MXNET_SAFE_ACCUMULATION=1 is recommended for softmax with float16 inputs. See 
https://mxnet.apache.org/api/faq/env_var for more details.
   [23:03:00] src/operator/nn/./../../common/utils.h:472: 
MXNET_SAFE_ACCUMULATION=1 is recommended for softmax with float16 inputs. See 
https://mxnet.apache.org/api/faq/env_var for more details.
   [23:03:00] src/operator/contrib/../tensor/./../../common/utils.h:472: 
MXNET_SAFE_ACCUMULATION=1 is recommended for LayerNorm with float16 inputs. See 
https://mxnet.apache.org/api/faq/env_var for more details.
   Traceback (most recent call last):
     File 
"/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/run_pretraining.py",
 line 237, in <module>
       train(data_train, model, nsp_loss, mlm_loss, len(vocab), ctx, store)
     File 
"/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/run_pretraining.py",
 line 192, in train
       fp16_trainer.step(1, max_norm=1)
     File 
"/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/fp16_utils.py",
 line 166, in step
       self.fp32_trainer.update(step_size)
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/gluon/trainer.py", 
line 436, in update
       self._update(ignore_stale_grad)
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/gluon/trainer.py", 
line 469, in _update
       updater(i, g, w)
     File 
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/optimizer/updater.py",
 line 91, in __call__
       self.optimizer.update_multi_precision([i], [w], [g], [self.states[i]])
     File 
"/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/optimizer/bert_adam.py",
 line 91, in update_multi_precision
       use_multi_precision = self.multi_precision and weight.dtype == 
numpy.float16
   AttributeError: 'list' object has no attribute 'dtype'
   ```
   
   ## To Reproduce
   Grabbing the scripts from [Gluon Model Zoo 
BERT](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html) 
   
   ### Steps to reproduce
   Downloaded the scripts and ran:
   ```
   python 
/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/run_pretraining.py
 --data ~/.mxnet/datasets/bert_input/part-000.npz --data_eval 
~/.mxnet/datasets/bert_input/part-000.npz --accumulate 4 --lr 1e-4 --num_steps 
100000 --gpus 0
   ```
   
   ## Environment
   DL AMI 27
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to