[GitHub] [incubator-mxnet] Zha0q1 opened a new issue #15241: [WIP] [RFC] Custom Operator Profiling Enhancement

GitBox Thu, 13 Jun 2019 13:24:23 -0700

Zha0q1 opened a new issue #15241: [WIP] [RFC] Custom Operator Profiling 
Enhancement 
URL: https://github.com/apache/incubator-mxnet/issues/15241
 
 
   ### Motivation
   
   MXNet allows users to create custom operators if the existing NDArray 
operators cannot meet their needs. However, profiling operator is not well 
supported currently. 
   
   To see the issues, we need to first understand how do custom operators work. 
First of all, CustomOperator is a singleton class, and it has its own task 
queue and worker threads. When a custom operator is executed by the engine, the 
engine thread will call Push() in CustomOperator. Next, Push() will create the 
actual work load and push it to the task queue, where its own worker thread 
will then pick the task up and run it.
   
   When users create a new custom operator, they need to define both the 
forward() and backward() functions in python. When this operator is executed, 
one of these two functions will be called. And those two functions consist of 
two kinds of code: 1) python code and Numpy code (I call them pure python code) 
which will run in the CustomOperator's own worker thread 2) code that calls 
NDArray operators (I call them sub-operators) which will then be pushed to the 
engine to run asynchronously from the CustomOperator's worker threads.
   
   Also in CustomOperator’s Push(), a special callback named 
“CustomOperator”(Now renamed to “Dummy_Wait”, we will also use this name below) 
is pushed to the engine. The idea is that “CustomOperator” have dependencies on 
the custom operator and it will get executed at last to make sure the custom 
operator event will span over the execution of both the pure python code as 
well as the sub-operators.
   
   With that said, there are some issues with the current profiler: 
   
   * The start timestamp of custom operator events is correct, but the end 
timestamp will wait for “Dummy_Wait which itself will wait for a bunch of 
variable deletions. The end result is that custom operator events will appear 
way longer than the actual computation. Also, because of parallelism, users can 
never know that at a specific point in a custom operator event, whether we are 
executing pure python code, the sub operators, or we are simply waiting for 
dependencies to clear.  
   * Because custom operator events will not end until “Dummy_Wait” starts, the 
former will not wrap around the latter. This violates the TraceCompass format 
will mess up chrome://tracing timeline. Each event has a begin and an end 
timestamp; however Chrome will treat one event as two separate ones, and say 
that one has no “BEGIN” and the other has no “END” 
   * All the custom operator events are called “Custom”. If users have more 
than one custom operator, they are not able to distinguish them.
   * Also, we cannot tell which NDArray operators are sub-operators of custom 
operators (i.e. NDArray operators called within custom operator’s forward or 
backward def).
   
   To avoid confusion, those issues need to be fixed.
   
   ### New Design 
   
   **Main changes**
   
   Regarding custom operators, users care about the performance of both the 
pure python code and the sub-operators. So, in our enhanced custom operator 
profiling, we should dissect custom operator calls into fine-gained events for 
both categories. Specifically, we will create a new domain called “custom 
operators”. There, we will have: 1) Events that represent the execution of the 
pure python code. 2) Events that represent the execution of the sub-operators. 
3) Also, for different custom operators, we should give events different 
namespace prefix. 
   
   This way, all the events will have accurate begin and end timestamps; there 
will also be no confusion about wait time.
   
   To achieve 1), we will create a new singleton class called CustomOpProfiler 
in src/profiler. This singleton class has APIs OnCustomBegin() and 
OnCustomEnd(). We will call them right before and after the callback to 
forward() or backward() (as we defined in python) in custom-inl.h to profile 
pure python code. Also, we want to put those events in the new “Custom 
Operator” domain.
   
   For 2), we already have operator profiling.
   
   For 3), we will also use CustomOpProfiler. We will create a mapping from 
CustomOperator worker thread_id to the registered name of that custom operator. 
Then, in PushAsync() in threaded_engine.cc, we will call GenerateDisplayName() 
in CustomOpProfiler to see if we are in a CustomOperator worker thread. If so, 
then this operator being pushed to the engine is a sub-operator of a custom 
operator. We want to create a display name by concatenating the name of this 
operator to a prefix which is the name of the custom operator, something like 
“CustomOp1::_plus_scalar”. Furthermore, in class ProfileOperator in profiler.h, 
we need to check the display name of the operator. If the name contains “::”, 
then we profile them within domain “Custom Operator.”
   
   I have created a pull request at: 
https://github.com/apache/incubator-mxnet/pull/15210.
   
   **More discussions**
   
   With this enhanced custom operator profiling, we also want to get rid of 
profiling “Dummy_Wait” entirely. This is done by adding a check in 
ProfileOperator in profiler.h. 
   
   Notice that because we are adding a function call to GenerateDisplayName() 
in PushAsync(), we are risking adding an overhead to every operator call (we 
need to get thread id and and the function has a lock). However in practice, 
because this function is short and has early return checks, this overhead is 
small enough to forgive. On my machine (2017 MacBook Pro 13’ i7), on average, 
for regular operator calls, this overhead is less than 1 micro second (it 
appears as 0). And for sub-operator calls, the overhead is always < 10 micro 
seconds and averages to < 5 micro seconds. This is to be compared to ~150 micro 
seconds taken by scalar addition on a 100*100 matrix. Notice this relative 
larger overhead will only happen to sub-operators of custom operators.
   
   ### Usage
   
   ```
   import mxnet as mx
   from mxnet import nd
   from mxnet import profiler
   from mxnet import context
   import threading
   
   # Define Custom Operator - Element wise Addition Multiplication
   class CustomAddOne(mx.operator.CustomOp):
       def forward(self, is_train, req, in_data, out_data, aux):        
           self.assign(out_data[0], req[0], in_data[0] + 1)
   
       def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
           self.assign(in_grad[0], req[0], out_grad[0])
   
   
   @mx.operator.register("CustomAddOne")
   class CustomAddOneProp(mx.operator.CustomOpProp):
       def __init__(self):
           super(CustomAddOneProp, self).__init__(need_top_grad=True)
   
       def list_arguments(self):
           return ['in']
   
       def list_outputs(self):
           return ['output']
   
       def infer_shape(self, in_shape):
           # inputs, outputs, aux
           return [in_shape[0]], [in_shape[0]], []
   
       def create_operator(self, ctx, shapes, dtypes):
           return CustomAddOne()
   
   
   class CustomAddTwo(mx.operator.CustomOp):
       def forward(self, is_train, req, in_data, out_data, aux):
           y = nd.sqrt(in_data[0])        
           self.assign(out_data[0], req[0], in_data[0] + 2)
   
       def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
           self.assign(in_grad[0], req[0], out_grad[0])
   
   
   @mx.operator.register("CustomAddTwo")
   class CustomAddTwoProp(mx.operator.CustomOpProp):
       def __init__(self):
           super(CustomAddTwoProp, self).__init__(need_top_grad=True)
   
       def list_arguments(self):
           return ['in']
   
       def list_outputs(self):
           return ['output']
   
       def infer_shape(self, in_shape):
           # inputs, outputs, aux
           return [in_shape[0]], [in_shape[0]], []
   
       def create_operator(self, ctx, shapes, dtypes):
           return CustomAddTwo()
   
   
   inp = mx.nd.zeros(shape=(500, 500))
   inp2 = mx.nd.zeros(shape=(500, 500))
   
   profiler.set_config(profile_all=True, aggregate_stats=True, continuous_dump 
= True)
   profiler.set_state('run')
   
   w = nd.Custom(inp, op_type="CustomAddOne")
   x = nd.sqrt(inp)
   y = nd.Custom(inp, op_type="CustomAddOne")
   z = nd.Custom(inp, op_type="CustomAddTwo")
   
   mx.nd.waitall()
   
   profiler.set_state('stop')
   profiler.dump()
   print(profiler.dumps(reset=True))
   ```
   
   
   Also, we can try using Gluon:
   ```
   
   import numpy as np
   import mxnet as mx
   from mxnet import profiler
   from mxnet import gluon, autograd
   import os
   
   class Dense(mx.operator.CustomOp):
       def __init__(self, bias):
           self._bias = bias
   
       def forward(self, is_train, req, in_data, out_data, aux):
           x = in_data[0].asnumpy()
           weight = in_data[1].asnumpy()
           y = x.dot(weight.T) + self._bias
           mx.nd.zeros(shape=(500, 500))
           self.assign(out_data[0], req[0], mx.nd.array(y))
   
       def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
           x = in_data[0].asnumpy()
           dy = out_grad[0].asnumpy()
           dx = dy.T.dot(x)
           self.assign(in_grad[0], req[0], mx.nd.array(dx))
   
   @mx.operator.register("dense") 
   class DenseProp(mx.operator.CustomOpProp):
       def __init__(self, bias):
           super(DenseProp, self).__init__(True)
           # we use constant bias here to illustrate how to pass arguments
           # to operators. All arguments are in string format so you need
           # to convert them back to the type you want.
           self._bias = float(bias)
   
       def list_arguments(self):
           return ['data', 'weight']
   
       def list_outputs(self):
           #  this can be omitted if you only have 1 output.
           return ['output']
   
       def infer_shape(self, in_shapes):
           data_shape = in_shapes[0]
           weight_shape = in_shapes[1]
           output_shape = (data_shape[0], weight_shape[0])
           # return 3 lists representing inputs shapes, outputs shapes, and aux 
data shapes.
           return (data_shape, weight_shape), (output_shape,), ()
   
       def create_operator(self, ctx, in_shapes, in_dtypes):
           #  create and return the CustomOp class.
           return Dense(self._bias)
   
   
   class DenseBlock(mx.gluon.Block):
       def __init__(self, in_channels, channels, bias, **kwargs):
           super(DenseBlock, self).__init__(**kwargs)
           self._bias = bias
           self.weight = self.params.get('weight', shape=(channels, 
in_channels))
   
       def forward(self, x):
           ctx = x.context
           return mx.nd.Custom(x, self.weight.data(ctx), bias=self._bias, 
op_type='dense')
   
   
   
   dense = DenseBlock(3, 5, 0.1)
   dense.initialize()
   x = mx.nd.uniform(shape=(4, 3))
   
   
   profiler.set_config(profile_all=True, aggregate_stats=True, continuous_dump 
= True)
   profiler.set_state('run')
   
   
   y = dense(x)
   
   
   mx.nd.waitall()
   profiler.set_state('stop')
   profiler.dump()
   print(profiler.dumps(reset=True))
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] Zha0q1 opened a new issue #15241: [WIP] [RFC] Custom Operator Profiling Enhancement

Reply via email to