[GitHub] [incubator-mxnet] tqchen edited a comment on issue #17097: [mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead

GitBox Fri, 20 Dec 2019 10:37:01 -0800

tqchen edited a comment on issue #17097: [mxnet 2.0][item 10.1] MXNet 
Imperative Op Invocation Overhead
URL: 
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568035665
 
 
   To followup on this thread about brining native support to tvm ffi, I will 
first discuss the ways to address tuple, and then discuss some of the pros and 
cons.
   
   First of all, we all know that at a time point we are going to translate the 
python data structure we know into C++. The main question is where that 
translation can happen. In the pybind case, the translation happens in the c++ 
side by directly passing pyobject to the c++. In the case of cython, the 
translation happens in the C API level. In the case of TVM FFI, the translation 
can happen at a python wrapping.
   
   The following code gives two examples of such translation(myempty0, and 
myempty1), the support for ```myempty1``` requires one additional commit in 
[this branch](https://github.com/tqchen/tvm/tree/int_tuple)
   
   In the first approach(myempty0), we directly unpack the tuple as positional 
arguments, and encode the data structure as a flattened argument. In the second 
approach(myempty1), it first creates a IntTuple object that the C++ side can 
recognize and then pass it to the C++ side. Note that at the moment all the 
operations are done through python, if there is a concern in terms of wrapping, 
we can certainly bring some of them into cython.
   
   We could introduce a third approach(myempty2): which directly passes a 
```PyObject*``` to the c++ side, then the developer can additional 
packages(e.g. pybind or related) to process the tuple. This third approach 
would achieve the same perf as pybind. However, there are some trade-offs, see 
discussion section.
   
   ```python
   import timeit
   import tvm
   
   nop = tvm._api_internal._nop
   
   setup = """
   import tvm
   x = tvm.nd.array([0])
   y = tvm.nd.array([1])
   nop = tvm._api_internal._nop
   """
   timer = timeit.Timer(setup=setup,
                        stmt='nop((1, 2,1))')
   timer.timeit(1)
   num_repeat = 1000
   print("tvm.nowrap:", timer.timeit(num_repeat) / num_repeat)
   
   setup = """
   import numpy as np
   """
   
   timer = timeit.Timer(setup=setup,
                                      stmt='np.empty((1,2,1))')
   timer.timeit(1)
   print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)
   
   def myempty0(shape):
       return nop(*shape)
   
   def myempty1(shape):
       return nop(tvm.container.IntTuple(*shape))
   
   setup = """
   import numpy as np
   import tvm
   from __main__ import myempty0, myempty1
   """
   
   timer = timeit.Timer(setup=setup,
                                     stmt='myempty0((1,2,1))')
   timer.timeit(1)
   print("tvm.myempty0:", timer.timeit(num_repeat) / num_repeat)
   
   timer = timeit.Timer(setup=setup,
                                      stmt='myempty1((1,2,1))')
   timer.timeit(1)
   print("tvm.myempty1:", timer.timeit(num_repeat) / num_repeat)
   ```
   
   Here are results on my computer:
   ```
   $ TVM_FFI=ctypes python test.py
   tvm.nowrap: 5.209704500000001e-05
   numpy.emmpty: 2.8674499999997715e-07
   tvm.myempty0: 7.312735000000015e-06
   tvm.myempty1: 1.3553711000000024e-05
   $ TVM_FFI=cython python test.py
   tvm.nowrap: 1.3689438999999998e-05
   numpy.emmpty: 2.86522999999983e-07
   tvm.myempty0: 1.7771199999999653e-07
   tvm.myempty1: 9.764679999999804e-07
   ```
   
   As we can see, the ```myempty0``` was quite fast. myempty1 was a bit slower 
due to the creation of the object(but still within the order of magnitude that 
we can use). Note that in the case of async exec and gradient tapping, we will 
need to create object to book-keeping the arguments, and myempty1 may not be a 
bad approach as the data structure is already created in the FFI level.
   
   ## Discussion
   
   As explained in the beginning, the real question was where should the 
wrapping happen.
   
   In the case of TVM, usually the wrapping happens at the native 
language(python) level, because we know there is a need of the python side 
wrapper for better code, type checking and docs. The translation forces the 
python arguments into arguments that can be recognized by the runtime.
   
   In the case of pybind, the translation happens at the C++ level, by calling 
into the python C API(the myempty2 approach is similar to this one).
   
   The advantage of exposing PyObject and related operations to the c++ level 
is certainly the deferred marshaling of data structures. On the other hand, 
such approach directly ties the FFI with python. It  means other language 
frontends can no longer take benefit of the new FFI. On a similar direction, if 
we want to package some of the functions into a minimum runtime that is 
independent of python, we can no longer do that. This is why while in theory we 
could bring PyObject(or a related Proxy) to TVM runtime, we have not done so 
far.
   
   Of course this is an interesting tradeoff, and everyone is welcomed to 
discuss their thoughts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] tqchen edited a comment on issue #17097: [mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead

Reply via email to