tqchen edited a comment on issue #17097: [mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead URL: https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568035665 To followup on this thread about brining native support to tvm ffi, I will first discuss the ways to address tuple, and then discuss some of the pros and cons. First of all, we all know that at a time point we are going to translate the python data structure we know into C++. The main question is where that translation can happen. In the pybind case, the translation happens in the c++ side by directly passing pyobject to the c++. In the case of cython, the translation happens in the C API level. In the case of TVM FFI, the translation can happen at a python wrapping. The following code gives two examples of such translation(myempty0, and myempty1), the support for ```myempty1``` requires one additional commit in [this branch](https://github.com/tqchen/tvm/tree/int_tuple) In the first approach(myempty0), we directly unpack the tuple as positional arguments, and encode the data structure as a flattened argument. In the second approach(myempty1), it first creates a IntTuple object that the C++ side can recognize and then pass it to the C++ side. Note that at the moment all the operations are done through python, if there is a concern in terms of wrapping, we can certainly bring some of them into cython. We could introduce a third approach(myempty2): which directly passes a ```PyObject*``` to the c++ side, then the developer can additional packages(e.g. pybind or related) to process the tuple. This third approach would achieve the same perf as pybind. However, there are some trade-offs, see discussion section. ```python import timeit import tvm nop = tvm._api_internal._nop setup = """ import tvm x = tvm.nd.array([0]) y = tvm.nd.array([1]) nop = tvm._api_internal._nop """ timer = timeit.Timer(setup=setup, stmt='nop((1, 2,1))') timer.timeit(1) num_repeat = 1000 print("tvm.nowrap:", timer.timeit(num_repeat) / num_repeat) setup = """ import numpy as np """ timer = timeit.Timer(setup=setup, stmt='np.empty((1,2,1))') timer.timeit(1) print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat) def myempty0(shape): return nop(*shape) def myempty1(shape): return nop(tvm.container.IntTuple(*shape)) setup = """ import numpy as np import tvm from __main__ import myempty0, myempty1 """ timer = timeit.Timer(setup=setup, stmt='myempty0((1,2,1))') timer.timeit(1) print("tvm.myempty0:", timer.timeit(num_repeat) / num_repeat) timer = timeit.Timer(setup=setup, stmt='myempty1((1,2,1))') timer.timeit(1) print("tvm.myempty1:", timer.timeit(num_repeat) / num_repeat) ``` Here are results on my computer: ``` $ TVM_FFI=ctypes python test.py tvm.nowrap: 5.209704500000001e-05 numpy.emmpty: 2.8674499999997715e-07 tvm.myempty0: 7.312735000000015e-06 tvm.myempty1: 1.3553711000000024e-05 $ TVM_FFI=cython python test.py tvm.nowrap: 1.3689438999999998e-05 numpy.emmpty: 2.86522999999983e-07 tvm.myempty0: 1.7771199999999653e-07 tvm.myempty1: 9.764679999999804e-07 ``` As we can see, the ```myempty0``` was quite fast. myempty1 was a bit slower due to the creation of the object(but still within the order of magnitude that we can use). Note that in the case of async exec and gradient tapping, we will need to create object to book-keeping the arguments, and myempty1 may not be a bad approach as the data structure is already created in the FFI level. ## Discussion As explained in the beginning, the real question was where should the wrapping happen. In the case of TVM, usually the wrapping happens at the native language(python) level, because we know there is a need of the python side wrapper for better code, type checking and docs. The translation forces the python arguments into arguments that can be recognized by the runtime. In the case of pybind, the translation happens at the C++ level, by calling into the python C API(the myempty2 approach is similar to this one). The advantage of exposing PyObject and related operations to the c++ level is certainly the deferred marshaling of data structures. On the other hand, such approach directly ties the FFI with python. It means other language frontends can no longer take benefit of the new FFI. On a similar direction, if we want to package some of the functions into a minimum runtime that is independent of python, we can no longer do that. This is why while in theory we could bring PyObject(or a related Proxy) to TVM runtime, we have not done so far. Of course this is an interesting tradeoff, and everyone is welcomed to discuss their thoughts.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
