[GitHub] [incubator-mxnet] matteosal opened a new issue, #21059: Static graph allocation in cached op consumes much more memory without giving other benefits?

GitBox Tue, 07 Jun 2022 06:37:51 -0700


matteosal opened a new issue, #21059:
URL: https://github.com/apache/incubator-mxnet/issues/21059


   I bumped into this while investigating memory efficiency of v2.0 vs v1.6. I 
am building master branch (v2.0) from commit 
https://github.com/apache/incubator-mxnet/commit/fabcd145cd496628791f9f2ea813048360ac33ca
 and 1.x branch (v1.6) from commit 
https://github.com/apache/incubator-mxnet/commit/6eec9da55c5096079355d1f1a5fa58dcf35d6752
   
   This script loads a BERT implementation (json files attached) and runs it 
multiple times while increasing the input sequence length:
   ```
   import mxnet as mx
   from mxnet import autograd
   import json
   import os, psutil
   import time
   
   symbol_path = '/home/matteo/bert_sym.json'
   shapes_path = '/home/matteo/bert_shapes.json'
   
   def get_memory_usage():
        return psutil.Process(os.getpid()).memory_info().rss / 1e+6
   
   sym = mx.symbol.load(symbol_path)
   with open(shapes_path) as shapes_file:
       shapes = json.load(shapes_file)
   
   version = mx.__version__
   print('version: ' + version)
   
   if(version == '2.0.0'):
        inputs = [mx.nd.ones(shape) for shape in shapes.values()]
        static_alloc = False
        cached_op = mx.ndarray.CachedOp(sym, flags=[('static_alloc', 
static_alloc)])
        print('static_alloc: ' + str(static_alloc))
   else:
        arrays = {name:mx.nd.ones(shape) for (name, shape) in shapes.items()}
        ex = sym.bind(mx.cpu(), arrays, grad_req='null')
   
   batch_size = 32
   seq_lengths = [5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40, 48, 56, 64, 
88]
   
   print('initial memory: ' + str(get_memory_usage()))
   print()
   for len in seq_lengths:
        input = mx.nd.ones([batch_size, len, 2])
        start = time.time()
        if(version == '2.0.0'):
                inputs[0] = input
                output = cached_op(*inputs, default_ctx=mx.cpu())
        else:
                arrays['.Inputs.Input'] = input
                ex = sym.bind(mx.cpu(), arrays, grad_req='null', shared_exec=ex)
                ex.forward()
                output = ex.outputs[0]
        mx.ndarray.waitall()
        end = time.time()
        print('input length: ' + str(len) + ', output shape: ' + 
str(output.shape) + ', memory: ' + str(get_memory_usage()) + ', time: ' + 
str(end - start))
   ```
   
[bert_model.zip](https://github.com/apache/incubator-mxnet/files/8853467/bert_model.zip)
   
   I have ran this script with both v2.0 and v1.6. For v2.0 you can change the 
line `static_alloc = False` to explore the effects of static allocation. This 
is what I get:
   ```
   version: 1.6.0
   initial memory: 513.232896
   
   input length: 5, output shape: (160, 1, 768), memory: 627.634176, time: 
0.17993521690368652
   input length: 6, output shape: (192, 1, 768), memory: 634.53184, time: 
0.16498708724975586
   input length: 7, output shape: (224, 1, 768), memory: 636.997632, time: 
0.1961820125579834
   input length: 8, output shape: (256, 1, 768), memory: 639.373312, time: 
0.3744235038757324
   input length: 10, output shape: (320, 1, 768), memory: 646.094848, time: 
0.6234118938446045
   input length: 12, output shape: (384, 1, 768), memory: 647.86432, time: 
0.7562270164489746
   input length: 14, output shape: (448, 1, 768), memory: 648.101888, time: 
0.817521333694458
   input length: 16, output shape: (512, 1, 768), memory: 650.092544, time: 
0.546708345413208
   input length: 20, output shape: (640, 1, 768), memory: 653.774848, time: 
0.8869819641113281
   input length: 24, output shape: (768, 1, 768), memory: 663.30624, time: 
1.1742339134216309
   input length: 28, output shape: (896, 1, 768), memory: 675.590144, time: 
1.5763235092163086
   input length: 32, output shape: (1024, 1, 768), memory: 679.415808, time: 
1.2759253978729248
   input length: 40, output shape: (1280, 1, 768), memory: 685.662208, time: 
1.3794775009155273
   input length: 48, output shape: (1536, 1, 768), memory: 691.77344, time: 
1.8214092254638672
   input length: 56, output shape: (1792, 1, 768), memory: 700.432384, time: 
2.0428481101989746
   input length: 64, output shape: (2048, 1, 768), memory: 706.981888, time: 
2.3951635360717773
   input length: 88, output shape: (2816, 1, 768), memory: 742.8096, time: 
3.1772890090942383
   ```
   ```
   version: 2.0.0
   static_alloc: False
   initial memory: 450.53952
   
   input length: 5, output shape: (160, 1, 768), memory: 646.582272, time: 
0.11594986915588379
   input length: 6, output shape: (192, 1, 768), memory: 655.331328, time: 
0.11730408668518066
   input length: 7, output shape: (224, 1, 768), memory: 660.733952, time: 
0.12042427062988281
   input length: 8, output shape: (256, 1, 768), memory: 666.7264, time: 
0.13665461540222168
   input length: 10, output shape: (320, 1, 768), memory: 673.849344, time: 
0.16487836837768555
   input length: 12, output shape: (384, 1, 768), memory: 688.164864, time: 
0.20292162895202637
   input length: 14, output shape: (448, 1, 768), memory: 694.419456, time: 
0.22304844856262207
   input length: 16, output shape: (512, 1, 768), memory: 702.029824, time: 
0.26688313484191895
   input length: 20, output shape: (640, 1, 768), memory: 722.939904, time: 
0.31215453147888184
   input length: 24, output shape: (768, 1, 768), memory: 746.889216, time: 
0.43781352043151855
   input length: 28, output shape: (896, 1, 768), memory: 768.385024, time: 
0.7122831344604492
   input length: 32, output shape: (1024, 1, 768), memory: 772.091904, time: 
0.8170092105865479
   input length: 40, output shape: (1280, 1, 768), memory: 806.465536, time: 
0.9898905754089355
   input length: 48, output shape: (1536, 1, 768), memory: 853.385216, time: 
1.1750028133392334
   input length: 56, output shape: (1792, 1, 768), memory: 901.287936, time: 
1.3260750770568848
   input length: 64, output shape: (2048, 1, 768), memory: 953.700352, time: 
1.5255098342895508
   input length: 88, output shape: (2816, 1, 768), memory: 1040.379904, time: 
2.304577350616455
   ```
   ```
   version: 2.0.0
   static_alloc: True
   initial memory: 460.10368
   
   input length: 5, output shape: (160, 1, 768), memory: 707.33824, time: 
0.13922739028930664
   input length: 6, output shape: (192, 1, 768), memory: 774.643712, time: 
0.10790896415710449
   input length: 7, output shape: (224, 1, 768), memory: 819.228672, time: 
0.12243270874023438
   input length: 8, output shape: (256, 1, 768), memory: 831.901696, time: 
0.13582420349121094
   input length: 10, output shape: (320, 1, 768), memory: 853.38112, time: 
0.17856955528259277
   input length: 12, output shape: (384, 1, 768), memory: 967.712768, time: 
0.20433759689331055
   input length: 14, output shape: (448, 1, 768), memory: 987.136, time: 
0.22087836265563965
   input length: 16, output shape: (512, 1, 768), memory: 1007.665152, time: 
0.24690675735473633
   input length: 20, output shape: (640, 1, 768), memory: 1199.075328, time: 
0.3204777240753174
   input length: 24, output shape: (768, 1, 768), memory: 1425.625088, time: 
0.3830389976501465
   input length: 28, output shape: (896, 1, 768), memory: 1689.489408, time: 
0.4304239749908447
   input length: 32, output shape: (1024, 1, 768), memory: 1726.783488, time: 
0.515824556350708
   input length: 40, output shape: (1280, 1, 768), memory: 2105.565184, time: 
0.9770240783691406
   input length: 48, output shape: (1536, 1, 768), memory: 2558.291968, time: 
1.218252420425415
   input length: 56, output shape: (1792, 1, 768), memory: 3089.73568, time: 
1.4080862998962402
   input length: 64, output shape: (2048, 1, 768), memory: 3693.993984, time: 
1.6184279918670654
   input length: 88, output shape: (2816, 1, 768), memory: 4524.1344, time: 
2.394423246383667
   ```
   Besides being glad that v2.0 has faster timings, there are 2 things I don't 
understand here:
   1) Static allocation uses roughly 4 times more memory while not being any 
faster. Is this expected for such a model?
   2) In terms of memory usage, static allocation in `CachedOp` is supposed to 
be closer to the scenario of old executors in v1.6 because those are also 
supposed to statically allocate their computational graphs. But yet v1.6 uses 
way less memory than v2.0 + static allocation, and also less memory than 2.0 
without static allocation. What's happening here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org

[GitHub] [incubator-mxnet] matteosal opened a new issue, #21059: Static graph allocation in cached op consumes much more memory without giving other benefits?

Reply via email to