barry-jin commented on issue #19420: URL: https://github.com/apache/incubator-mxnet/issues/19420#issuecomment-721890364
## Update ### To Reproduce It is able to reproduce this error by running a small set of tests. ``` python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp git checkout master python3 -m pip install --quiet -e .[extras] python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py ``` <details> <summary>Error Message</summary> ``` Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1362441855 to reproduce. ============================== test session starts =============================== platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3 cachedir: .pytest_cache rootdir: /workspace/gluon-nlp, configfile: pytest.ini plugins: cov-2.10.1 collected 95 items tests/test_models.py::test_list_backbone_names PASSED [ 1%] tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED [ 2%] tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED [ 3%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED [ 4%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED [ 5%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED [ 6%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED [ 7%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED [ 8%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED [ 9%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED [ 10%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 11%] tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED [ 12%] tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED [ 13%] tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED [ 14%] tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED [ 15%] tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED [ 16%] tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED [ 17%] tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED [ 18%] tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED [ 20%] tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED [ 21%] tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED [ 22%] tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED [ 23%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED [ 24%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED [ 25%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED [ 26%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED [ 27%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED [ 28%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED [ 29%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 30%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 31%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED [ 32%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED [ 33%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 34%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 35%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED [ 36%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED [ 37%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 38%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 40%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED [ 41%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED [ 42%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 43%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 44%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED [ 45%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED [ 46%] tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED [ 47%] tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED [ 48%] tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED [ 49%] tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED [ 50%] tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED [ 51%] tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED [ 52%] tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED [ 53%] tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED [ 54%] tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED [ 55%] tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED [ 56%] tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED [ 57%] tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED [ 58%] tests/test_models_albert.py::test_list_pretrained_albert PASSED [ 60%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 61%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 62%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 63%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 64%] tests/test_models_bart.py::test_list_pretrained_bart PASSED [ 65%] tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED [ 66%] tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED [ 67%] tests/test_models_bart.py::test_bart_cfg_registry PASSED [ 68%] tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED [ 69%] tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED [ 70%] tests/test_models_bert.py::test_list_pretrained_bert PASSED [ 71%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED [ 72%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED [ 73%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED [ 74%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 75%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 76%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 77%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 78%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 80%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 81%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 82%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 83%] tests/test_models_electra.py::test_list_pretrained_electra PASSED [ 84%] tests/test_models_electra.py::test_electra_model[ctx0-auto] PASSED [ 85%] tests/test_models_electra.py::test_electra_model[ctx0-NT] PASSED [ 86%] tests/test_models_electra.py::test_electra_model[ctx0-TN] PASSED [ 87%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-gluon_electra_small_owt] PASSED [ 88%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_base] PASSED [ 89%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_large] PASSED [ 90%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_small] PASSED [ 91%] tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED [ 92%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED [ 93%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED [ 94%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED [ 95%] tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED [ 96%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED [ 97%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED [ 98%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] FAILED [100%] ============================================================== FAILURES ============================================================== _____________________________________________________ test_gpt2[ctx0-gpt2_774M] ______________________________________________________ model_name = 'gpt2_774M', ctx = gpu(0) @pytest.mark.slow @pytest.mark.remote_required @pytest.mark.parametrize('model_name', ['gpt2_124M', 'gpt2_355M', 'gpt2_774M']) def test_gpt2(model_name, ctx): # test from pretrained assert len(list_pretrained_gpt2()) > 0 with tempfile.TemporaryDirectory() as root, ctx: cfg, tokenizer, params_path, lm_params_path =\ get_pretrained_gpt2(model_name, load_backbone=True, load_lm=True, root=root) assert cfg.MODEL.vocab_size == len(tokenizer.vocab) # test backbone gpt2_model = GPT2Model.from_cfg(cfg) gpt2_model.load_parameters(params_path) # test lm model gpt2_lm_model = GPT2ForLM(cfg) gpt2_lm_model.load_parameters(lm_params_path) # test forward batch_size = 3 seq_length = 32 vocab_size = len(tokenizer.vocab) input_ids = mx.np.array( np.random.randint( 2, vocab_size, (batch_size, seq_length) ), dtype=np.int32, ctx=ctx ) logits, _ = gpt2_lm_model( input_ids, gpt2_lm_model.init_states(batch_size, ctx) ) > mx.npx.waitall() tests/test_models_gpt2.py:142: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py:240: in waitall check_call(_LIB.MXNDArrayWaitAll()) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ret = -1 def check_call(ret): """Check the return value of C API call. This function will raise an exception when an error occurs. Wrap every API call with this function. Parameters ---------- ret : int return value from API calls. """ if ret != 0: > raise get_last_ffi_error() E mxnet.base.MXNetError: Traceback (most recent call last): E File "../src/storage/./pooled_storage_manager.h", line 192 E MXNetError: Memory allocation failed out of memory /usr/local/lib/python3.6/dist-packages/mxnet/base.py:246: MXNetError -------------------------------------------------------- Captured stdout call -------------------------------------------------------- Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-9dc62091.vocab from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-9dc62091.vocab... Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-396d4d8e.merges from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-396d4d8e.merges... Downloading /tmp/tmpbj080s2v/gpt2_774M/model-9917e24e.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model-9917e24e.params... Downloading /tmp/tmpbj080s2v/gpt2_774M/model_lm-cfbfa641.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model_lm-cfbfa641.params... -------------------------------------------------------- Captured stderr call -------------------------------------------------------- 100%|██████████| 558k/558k [00:00<00:00, 7.15MiB/s] 100%|██████████| 456k/456k [00:00<00:00, 6.39MiB/s] 100%|██████████| 3.10G/3.10G [01:16<00:00, 40.5MiB/s] 100%|██████████| 3.10G/3.10G [01:20<00:00, 38.6MiB/s] ========================================================== warnings summary ========================================================== src/gluonnlp/attention_cell.py:715 /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s """ src/gluonnlp/op.py:226 /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p """ tests/test_models_albert.py: 6 warnings tests/test_models_bart.py: 2 warnings tests/test_models_bert.py: 3 warnings tests/test_models_gpt2.py: 3 warnings /usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize. v.initialize(None, ctx, init, force_reinit=force_reinit) -- Docs: https://docs.pytest.org/en/stable/warnings.html ====================================================== short test summary info ======================================================= FAILED tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] - mxnet.base.MXNetError: Traceback (most recent call last): ======================================= 1 failed, 94 passed, 16 warnings in 1990.67s (0:33:10) ======================================= ``` </details> ### Possible memory leak. There is possible GPU memory leak when running `test_models.py::test_tvm_integration` on 10.22 nightly release. ``` python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py::test_tvm_integration ``` <img width="1792" alt="Screen Shot 2020-11-04 at 9 39 48 AM" src="https://user-images.githubusercontent.com/69359374/98151677-6b9f0780-1e85-11eb-8b6e-5a3cdc4c3235.png"> <img width="1792" alt="Screen Shot 2020-11-04 at 9 44 52 AM" src="https://user-images.githubusercontent.com/69359374/98151629-50cc9300-1e85-11eb-8e3e-126501e72832.png"> <img width="1790" alt="Screen Shot 2020-11-04 at 9 40 09 AM" src="https://user-images.githubusercontent.com/69359374/98151657-62159f80-1e85-11eb-8ec3-aa58884cbfb4.png"> <img width="1792" alt="Screen Shot 2020-11-04 at 9 45 13 AM" src="https://user-images.githubusercontent.com/69359374/98151688-6fcb2500-1e85-11eb-91c6-827374985bcf.png"> ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
