[GitHub] [incubator-mxnet] barry-jin commented on issue #19420: [BUG] Fatal Python error when running GluonNLP pytest on MXNet linux nightly build

GitBox Wed, 04 Nov 2020 10:11:11 -0800


barry-jin commented on issue #19420:
URL: 
https://github.com/apache/incubator-mxnet/issues/19420#issuecomment-721890364



   ## Update 
   ### To Reproduce 
   It is able to reproduce this error by running a small set of tests. 
   ```
   python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f 
https://dist.mxnet.io/python
   git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp
   git checkout master
   python3 -m pip install --quiet -e .[extras]
   python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py 
tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py
   ```
   <details>
   <summary>Error Message</summary>
   
   ```
   Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1362441855 
to reproduce.
   ============================== test session starts 
===============================
   platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- 
/usr/bin/python3
   cachedir: .pytest_cache
   rootdir: /workspace/gluon-nlp, configfile: pytest.ini
   plugins: cov-2.10.1
   collected 95 items                                                           
    
   
   tests/test_models.py::test_list_backbone_names PASSED                      [ 
 1%]
   tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED [ 
 2%]
   tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED 
[  3%]
   tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED 
[  4%]
   tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] 
PASSED [  5%]
   tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] 
PASSED [  6%]
   tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] 
PASSED [  7%]
   tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] 
PASSED [  8%]
   tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] 
PASSED [  9%]
   tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] 
PASSED [ 10%]
   
tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] 
PASSED [ 11%]
   tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] 
PASSED [ 12%]
   tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED   [ 
13%]
   tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED 
[ 14%]
   tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED   [ 
15%]
   tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED  [ 
16%]
   tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED  [ 
17%]
   tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED             [ 
18%]
   tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED            [ 
20%]
   tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED             [ 
21%]
   tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED             [ 
22%]
   tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] 
PASSED [ 23%]
   tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED  [ 
24%]
   tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED [ 
25%]
   tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED     [ 
26%]
   tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED    [ 
27%]
   tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED     [ 
28%]
   tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED    [ 
29%]
   
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] 
PASSED [ 30%]
   
tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base]
 PASSED [ 31%]
   tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] 
PASSED [ 32%]
   tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] 
PASSED [ 33%]
   
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] 
PASSED [ 34%]
   
tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base]
 PASSED [ 35%]
   tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] 
PASSED [ 36%]
   tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] 
PASSED [ 37%]
   
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] 
PASSED [ 38%]
   
tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base]
 PASSED [ 40%]
   tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] 
PASSED [ 41%]
   tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] 
PASSED [ 42%]
   
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] 
PASSED [ 43%]
   
tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base]
 PASSED [ 44%]
   tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] 
PASSED [ 45%]
   tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] 
PASSED [ 46%]
   tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED [ 
47%]
   tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED   [ 
48%]
   tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED   [ 
49%]
   tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED     [ 
50%]
   tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED   [ 
51%]
   tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED     [ 
52%]
   tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED        [ 
53%]
   tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED          [ 
54%]
   tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED          [ 
55%]
   tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED   [ 
56%]
   tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED     [ 
57%]
   tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED     [ 
58%]
   tests/test_models_albert.py::test_list_pretrained_albert PASSED            [ 
60%]
   
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] 
PASSED [ 61%]
   
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] 
PASSED [ 62%]
   
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2]
 PASSED [ 63%]
   
tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2]
 PASSED [ 64%]
   tests/test_models_bart.py::test_list_pretrained_bart PASSED                [ 
65%]
   tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED             [ 
66%]
   tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED            [ 
67%]
   tests/test_models_bart.py::test_bart_cfg_registry PASSED                   [ 
68%]
   tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED                 [ 
69%]
   tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED                [ 
70%]
   tests/test_models_bert.py::test_list_pretrained_bert PASSED                [ 
71%]
   tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED           [ 
72%]
   tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED             [ 
73%]
   tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED             [ 
74%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base]
 PASSED [ 75%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large]
 PASSED [ 76%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large]
 PASSED [ 77%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base]
 PASSED [ 78%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large]
 PASSED [ 80%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large]
 PASSED [ 81%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base]
 PASSED [ 82%]
   
tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] 
PASSED [ 83%]
   tests/test_models_electra.py::test_list_pretrained_electra PASSED          [ 
84%]
   tests/test_models_electra.py::test_electra_model[ctx0-auto] PASSED         [ 
85%]
   tests/test_models_electra.py::test_electra_model[ctx0-NT] PASSED           [ 
86%]
   tests/test_models_electra.py::test_electra_model[ctx0-TN] PASSED           [ 
87%]
   
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-gluon_electra_small_owt]
 PASSED [ 88%]
   
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_base]
 PASSED [ 89%]
   
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_large]
 PASSED [ 90%]
   
tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_small]
 PASSED [ 91%]
   tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED                [ 
92%]
   tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED        [ 
93%]
   tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED          [ 
94%]
   tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED          [ 
95%]
   tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED       [ 
96%]
   tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED                [ 
97%]
   tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED                  
                                      [ 98%]
   tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] FAILED                  
                                                  [100%]
   
   ============================================================== FAILURES 
==============================================================
   _____________________________________________________ 
test_gpt2[ctx0-gpt2_774M] ______________________________________________________
   
   model_name = 'gpt2_774M', ctx = gpu(0)
   
       @pytest.mark.slow
       @pytest.mark.remote_required
       @pytest.mark.parametrize('model_name', ['gpt2_124M', 'gpt2_355M', 
'gpt2_774M'])
       def test_gpt2(model_name, ctx):
           # test from pretrained
           assert len(list_pretrained_gpt2()) > 0
           with tempfile.TemporaryDirectory() as root, ctx:
               cfg, tokenizer, params_path, lm_params_path =\
                   get_pretrained_gpt2(model_name, load_backbone=True, 
load_lm=True, root=root)
               assert cfg.MODEL.vocab_size == len(tokenizer.vocab)
               # test backbone
               gpt2_model = GPT2Model.from_cfg(cfg)
               gpt2_model.load_parameters(params_path)
               # test lm model
               gpt2_lm_model = GPT2ForLM(cfg)
               gpt2_lm_model.load_parameters(lm_params_path)
       
               # test forward
               batch_size = 3
               seq_length = 32
               vocab_size = len(tokenizer.vocab)
               input_ids = mx.np.array(
                   np.random.randint(
                       2,
                       vocab_size,
                       (batch_size, seq_length)
                   ),
                   dtype=np.int32,
                   ctx=ctx
               )
               logits, _ = gpt2_lm_model(
                   input_ids,
                   gpt2_lm_model.init_states(batch_size, ctx)
               )
   >           mx.npx.waitall()
   
   tests/test_models_gpt2.py:142: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py:240: in 
waitall
       check_call(_LIB.MXNDArrayWaitAll())
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
   
   ret = -1
   
       def check_call(ret):
           """Check the return value of C API call.
       
           This function will raise an exception when an error occurs.
           Wrap every API call with this function.
       
           Parameters
           ----------
           ret : int
               return value from API calls.
           """
           if ret != 0:
   >           raise get_last_ffi_error()
   E           mxnet.base.MXNetError: Traceback (most recent call last):
   E             File "../src/storage/./pooled_storage_manager.h", line 192
   E           MXNetError: Memory allocation failed out of memory
   
   /usr/local/lib/python3.6/dist-packages/mxnet/base.py:246: MXNetError
   -------------------------------------------------------- Captured stdout 
call --------------------------------------------------------
   Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-9dc62091.vocab from 
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-9dc62091.vocab...
   Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-396d4d8e.merges from 
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-396d4d8e.merges...
   Downloading /tmp/tmpbj080s2v/gpt2_774M/model-9917e24e.params from 
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model-9917e24e.params...
   Downloading /tmp/tmpbj080s2v/gpt2_774M/model_lm-cfbfa641.params from 
https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model_lm-cfbfa641.params...
   -------------------------------------------------------- Captured stderr 
call --------------------------------------------------------
   100%|██████████| 558k/558k [00:00<00:00, 7.15MiB/s]
   100%|██████████| 456k/456k [00:00<00:00, 6.39MiB/s]
   100%|██████████| 3.10G/3.10G [01:16<00:00, 40.5MiB/s]
   100%|██████████| 3.10G/3.10G [01:20<00:00, 38.6MiB/s]
   ========================================================== warnings summary 
==========================================================
   src/gluonnlp/attention_cell.py:715
     /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: 
DeprecationWarning: invalid escape sequence \s
       """
   
   src/gluonnlp/op.py:226
     /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid 
escape sequence \p
       """
   
   tests/test_models_albert.py: 6 warnings
   tests/test_models_bart.py: 2 warnings
   tests/test_models_bert.py: 3 warnings
   tests/test_models_gpt2.py: 3 warnings
     /usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py:572: 
UserWarning: Parameter 'weight' is already initialized, ignoring. Set 
force_reinit=True to re-initialize.
       v.initialize(None, ctx, init, force_reinit=force_reinit)
   
   -- Docs: https://docs.pytest.org/en/stable/warnings.html
   ====================================================== short test summary 
info =======================================================
   FAILED tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] - 
mxnet.base.MXNetError: Traceback (most recent call last):
   ======================================= 1 failed, 94 passed, 16 warnings in 
1990.67s (0:33:10) =======================================
   ```
   
   </details>
   
   ### Possible memory leak. 
   There is possible GPU memory leak when running 
`test_models.py::test_tvm_integration` on 10.22 nightly release. 
   ```
   python3 -m pytest --device='gpu' --verbose --runslow 
tests/test_models.py::test_tvm_integration
   ```
   
   <img width="1792" alt="Screen Shot 2020-11-04 at 9 39 48 AM" 
src="https://user-images.githubusercontent.com/69359374/98151677-6b9f0780-1e85-11eb-8b6e-5a3cdc4c3235.png";>
   <img width="1792" alt="Screen Shot 2020-11-04 at 9 44 52 AM" 
src="https://user-images.githubusercontent.com/69359374/98151629-50cc9300-1e85-11eb-8e3e-126501e72832.png";>
   <img width="1790" alt="Screen Shot 2020-11-04 at 9 40 09 AM" 
src="https://user-images.githubusercontent.com/69359374/98151657-62159f80-1e85-11eb-8ec3-aa58884cbfb4.png";>
   <img width="1792" alt="Screen Shot 2020-11-04 at 9 45 13 AM" 
src="https://user-images.githubusercontent.com/69359374/98151688-6fcb2500-1e85-11eb-91c6-827374985bcf.png";>
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-mxnet] barry-jin commented on issue #19420: [BUG] Fatal Python error when running GluonNLP pytest on MXNet linux nightly build

Reply via email to