After some thoughts along the direction, I find a better and fun answer to the above question: support tuple/ellipsis/slice in tvm ffi effectively.
I quickly hacked up a POC in https://github.com/tqchen/tvm/tree/pyffi that supports the following benchmark script(disclaimer: it is only a POC so not intended for use or fully optimized, but it demonstrates all the technical flows necessary to make a fully functioning FFI). ```python import timeit import tvm nop = tvm._api_internal._nop setup = """ import tvm nop = tvm._api_internal._nop """ timer = timeit.Timer(setup=setup, stmt='nop((None,..., slice(0, 100, 2)))') timer.timeit(1) num_repeat = 1000 print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat) setup = """ import numpy as np """ timer = timeit.Timer(setup=setup, stmt='np.empty((1,2,1))') timer.timeit(1) print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat) setup = """ import tvm nop = tvm._api_internal._nop """ timer = timeit.Timer(setup=setup, stmt='nop("mystr")') timer.timeit(1) num_repeat = 1000 print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat) ``` On my laptop(macbook 13inch), the results are as follows ``` $ TVM_FFI=cython python benchmark_ffi.py tvm.tuple_slice_ellipsis_combo: 4.615739999999924e-07 numpy.emmpty: 2.7016599999998834e-07 tvm.str_arg: 2.3390799999997714e-07 ``` ## What is Implemented in the POC In the POC, we introduced specific objects for Ellipsis, Slice and Tuple(already supported in ADT). During a PackedFunc call, a python tuple/ellipsis/slice was converted into the object that is supported by the backend. We implemented a cython version(the previous recursive conversion was in python) to back it up. The reason that we are able to create Object in the cython side is because all TVM object has been recently converted to be POD-C compatible, so the object can be created in the cython side without crossing DLL boundary and passed to the c++ backend. We can see from the benchmark that the cost of such deep-copy was at a reasonable level. We also only used the default memory allocator, so there could be space for further improvements. ## Discussions Please also see tradeoff discussions in the last post. As we can see, the main difference here is where to do the conversion, and whether do we do lazy/deep copy: - In the case of pybind: conversion is happened in the c++ side, data structures are lazily created. - In the case of the POC: conversion is happened in cython, data structures are deeply translated into another in-memory format. The laziness certainly avoids a copy in cases where we do not necessarily need to book-keep the created argument. On the other hand, supporting a common data structure in the c++ side means the binding can potentially be reused by other language frontends. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568325041