szha commented on issue #14973: [MXNET-1404] Added the GPU memory profiler
URL: https://github.com/apache/incubator-mxnet/pull/14973#issuecomment-495969814
 
 
   We had a discussion yesterday about the design (w/ @ArmageddonKnight  
@anirudh2290 @eric-haibin-lin). We all agreed that this is an awesome feature 
to be added to mxnet. There were some concerns around the current design and 
one particular point is about requiring names passed from the frontend for each 
NDArray, along with the addition of the named imperative invoke interface. The 
motivation was to be able to identify and track the source of allocation to 
specific code.
   
   As agreed, here I offer an alternative design (and mock-up for the user 
experience) for automatically naming NDArrays that are easy to identify, 
without requiring manual naming of all NDArrays. It can be done by providing 
the interfaces to make it easy for users to mark a region of code. That, in 
combination with the information from the operators, should allow users to 
easily identify the part of code that's responsible for certain memory 
allocation.
   
   The design enables:
   - user specified scope in code. 
   - identifying arrays within that scope.
   - without changes to existing execution interface like imperative invoke.
   
   Suppose user wants to profile a function that looks like:
   ```
   def function1(nd1, nd2):
       r1 = mx.nd.op1(nd1)
       r2 = mx.nd.op1(nd2)
       r3 = mx.nd.op2(r1, r2)
       return r3
   ```
   
   One easy way is to have an interface to the users for marking the scope.
   ```
   def function1(nd1, nd2):
       with mx.profiler.scope('function1'):
           r1 = mx.nd.op1(nd1)
           r2 = mx.nd.op1(nd2)
           r3 = mx.nd.op2(r1, r2)
           return r3
   ```
   
   When entering the profiler scope, it can invoke a new C API that sets the 
thread local context of the profiler scope name by prepending the outer scope 
name (if any) to the user specified scope name, and saves the old scope. When 
exiting, it restores the old scope. When allocation happens, the entries should 
be named according the current scope name, and have a counter-based naming for 
op to differentiate multiple invocation of the same op (i.e. the allocation 
identifier should look like `{scope}.{op}.{counter}`). From the code example, 
this should have the effect of producing these allocation records:
   ```
   name,bytes
   function1.op1.1,x
   function1.op1.2,y
   function1.op2.1,z
   ```
   
   If user wants to have more granularity in the scope of the code, this is 
possible:
   ```
   def function1(nd1, nd2):
       with mx.profiler.scope('function1'):
           r1 = mx.nd.op1(nd1)
           r2 = mx.nd.op1(nd2)
           with mx.profiler.scope('sec1'):
               r3 = mx.nd.op2(r1, r2)
           return r3
   ```
   which results in the following records:
   ```
   name,bytes
   function1.op1.1,x
   function1.op1.2,y
   function1.sec1.op2.1,z
   ```
   
   For easier usage, we can also utilize the decorator syntax:
   ```
   @mx.profiler.scope('function1')
   def function1(nd1, nd2):
       r1 = mx.nd.op1(nd1)
       r2 = mx.nd.op1(nd2)
       r3 = mx.nd.op2(r1, r2)
       return r3
   ```
   The decorator should wrap the decorated code within the aforementioned scope.
   
   For Gluon, we can update the `__call__` function in `Block` and 
`HybridBlock` and automatically use the class names as scopes. For optimizers, 
we can update the `update` function to achieve similar effect.
   
   Also, notice that the allocation identifier already has a hierarchy that can 
be used for aggregation. This can be utilized to aggregate results without 
requiring the mapping from the users. This means there would no longer be the 
need to ask for `SETME.py` changes.
   
   This design can be applied to both Symbol and NDArray. The allocation 
identifier should be treated separately from the variable names in Symbol, so 
that it provides a consistent experience for Gluon HybridBlocks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to