JonTanS opened a new issue #17960: Bert Training fails on MXNet 2.0 URL: https://github.com/apache/incubator-mxnet/issues/17960 ## Description Running bert training on MXNet 2.0 (Master Branch) fails. When I pip install MXNet 1.6 the training script runs fine. ### Error Message ``` [23:02:13] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager. INFO:root:Using training data at /home/ubuntu/.mxnet/datasets/bert_input/part-000.npz INFO:root:1 files found. [23:02:59] src/operator/contrib/../tensor/./../../common/utils.h:472: MXNET_SAFE_ACCUMULATION=1 is recommended for LayerNorm with float16 inputs. See https://mxnet.apache.org/api/faq/env_var for more details. [23:03:00] src/operator/nn/./../../common/utils.h:472: MXNET_SAFE_ACCUMULATION=1 is recommended for softmax with float16 inputs. See https://mxnet.apache.org/api/faq/env_var for more details. [23:03:00] src/operator/nn/./../../common/utils.h:472: MXNET_SAFE_ACCUMULATION=1 is recommended for softmax with float16 inputs. See https://mxnet.apache.org/api/faq/env_var for more details. [23:03:00] src/operator/contrib/../tensor/./../../common/utils.h:472: MXNET_SAFE_ACCUMULATION=1 is recommended for LayerNorm with float16 inputs. See https://mxnet.apache.org/api/faq/env_var for more details. Traceback (most recent call last): File "/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/run_pretraining.py", line 237, in <module> train(data_train, model, nsp_loss, mlm_loss, len(vocab), ctx, store) File "/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/run_pretraining.py", line 192, in train fp16_trainer.step(1, max_norm=1) File "/home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/fp16_utils.py", line 166, in step self.fp32_trainer.update(step_size) File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/gluon/trainer.py", line 436, in update self._update(ignore_stale_grad) File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/gluon/trainer.py", line 469, in _update updater(i, g, w) File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/optimizer/updater.py", line 91, in __call__ self.optimizer.update_multi_precision([i], [w], [g], [self.states[i]]) File "/home/ubuntu/.local/lib/python3.6/site-packages/gluonnlp/optimizer/bert_adam.py", line 91, in update_multi_precision use_multi_precision = self.multi_precision and weight.dtype == numpy.float16 AttributeError: 'list' object has no attribute 'dtype' ``` ## To Reproduce Grabbing the scripts from [Gluon Model Zoo BERT](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html) ### Steps to reproduce Downloaded the scripts and ran: ``` python /home/ubuntu/MXNet-Benchmarks/mxnet_scripts/training_scripts/bert/run_pretraining.py --data ~/.mxnet/datasets/bert_input/part-000.npz --data_eval ~/.mxnet/datasets/bert_input/part-000.npz --accumulate 4 --lr 1e-4 --num_steps 100000 --gpus 0 ``` ## Environment DL AMI 27
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
