tqchen edited a comment on issue #17097: [RFC][mxnet 2.0][item 10.1] MXNet 
Imperative Op Invocation Overhead
URL: 
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568325041
 
 
   After some thoughts along the direction, I find a better and fun answer to 
the above question to support tuple/ellipsis/slice in tvm ffi effectively.
   
   I hacked up a POC in https://github.com/tqchen/tvm/tree/poc-pyffi (lastest 
commit) that supports the following benchmark script(disclaimer: it is only a 
POC so not intended for use or fully optimized, but it demonstrates all the 
technical flows necessary to make a fully functioning FFI).
   
   ```python
   import timeit
   import tvm
   nop = tvm._api_internal._nop
   
   setup = """
   import tvm
   nop = tvm._api_internal._nop
   """
   timer = timeit.Timer(setup=setup,
                                     stmt='nop((None,..., slice(0, 100, 2)))')
   timer.timeit(1)
   num_repeat = 1000
   print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / 
num_repeat)
   
   
   setup = """
   import numpy as np
   """
   
   timer = timeit.Timer(setup=setup,
                                     stmt='np.empty((1,2,1))')
   timer.timeit(1)
   print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)
   
   setup = """
   import tvm
   nop = tvm._api_internal._nop
   """
   timer = timeit.Timer(setup=setup,
                                     stmt='nop("mystr")')
   timer.timeit(1)
   num_repeat = 1000
   print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)
   ```
   
   On my laptop(macbook 13inch), the results are as follows
   ```
   $ TVM_FFI=cython python benchmark_ffi.py
   tvm.tuple_slice_ellipsis_combo: 4.615739999999924e-07
   numpy.emmpty: 2.7016599999998834e-07
   tvm.str_arg: 2.3390799999997714e-07
   ```
   
   ##  What is Implemented in the POC 
   
   In the POC, we introduced specific objects for Ellipsis, Slice and 
Tuple(already supported in ADT). During a PackedFunc call, a python 
tuple/ellipsis/slice was  converted into the object that is supported by the 
backend. We implemented a cython version(the previous recursive conversion was 
in python) to back it up. 
   
   The reason that we are able to create Object in the cython side is because 
all TVM object has been recently converted to be POD-C compatible, so the 
object can be created in the cython side without crossing DLL boundary and 
passed to the c++ backend.
   
   We can see from the benchmark that the cost of such deep-copy was at a 
reasonable level. We also only used the default memory allocator, so there 
could be space for further improvements.
   
   ##  Technical Choices and Tradeoffs
   
   Please also see tradeoff discussions in the last post. As we can see, the 
main difference here is where to do the conversion, and whether do we do 
lazy/deep copy:
   
   - In the case of pybind: conversion is happened in the c++ side, data 
structures are lazily created.
   - In the case of the POC: conversion is happened in cython, data structures 
are deeply translated into another in-memory format.
   
   The laziness certainly avoids a copy in cases where we do not necessarily 
need to book-keep the created argument. On the other hand, supporting a common 
data structure in the c++ side means the binding can potentially be reused by 
other language frontends.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to