After some thoughts along the direction, I find a better and fun answer to the 
above question: support tuple/ellipsis/slice in tvm ffi effectively.

I quickly hacked up a POC in https://github.com/tqchen/tvm/tree/pyffi that 
supports the following benchmark script(disclaimer: it is only a POC so not 
intended for use or fully optimized, but it demonstrates all the technical 
flows necessary to make a fully functioning FFI).

```python
import timeit
import tvm
nop = tvm._api_internal._nop

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                                  stmt='nop((None,..., slice(0, 100, 2)))')
timer.timeit(1)
num_repeat = 1000
print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
"""

timer = timeit.Timer(setup=setup,
                                  stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                                  stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)
```

On my laptop(macbook 13inch), the results are as follows
```
$ TVM_FFI=cython python benchmark_ffi.py
tvm.tuple_slice_ellipsis_combo: 4.615739999999924e-07
numpy.emmpty: 2.7016599999998834e-07
tvm.str_arg: 2.3390799999997714e-07
```

##  What is Implemented in the POC 

In the POC, we introduced specific objects for Ellipsis, Slice and 
Tuple(already supported in ADT). During a PackedFunc call, a python 
tuple/ellipsis/slice was  converted into the object that is supported by the 
backend. We implemented a cython version(the previous recursive conversion was 
in python) to back it up. 

The reason that we are able to create Object in the cython side is because all 
TVM object has been recently converted to be POD-C compatible, so the object 
can be created in the cython side without crossing DLL boundary and passed to 
the c++ backend.

We can see from the benchmark that the cost of such deep-copy was at a 
reasonable level. We also only used the default memory allocator, so there 
could be space for further improvements.

##  Discussions

Please also see tradeoff discussions in the last post. As we can see, the main 
difference here is where to do the conversion, and whether do we do lazy/deep 
copy:

- In the case of pybind: conversion is happened in the c++ side, data 
structures are lazily created.
- In the case of the POC: conversion is happened in cython, data structures are 
deeply translated into another in-memory format.

The laziness certainly avoids a copy in cases where we do not necessarily need 
to book-keep the created argument. On the other hand, supporting a common data 
structure in the c++ side means the binding can potentially be reused by other 
language frontends.










-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568325041

Reply via email to