@ptrendx Yes, there is an effort of profiling engine code flow using VTune. We hope the exercise can pinpoint the hotspots that contribute to the most part of latency. Further time split for pure C++ part between setup code (shape/type inference, memory allocation, dependency setup) and op scheduling is also around 50% vs. 50%.
For the "fast path" data structures, I'm summarizing the items as follows (including the ones suggested by @sxjscience): - `tuple` and `list` since they can be interchangeable in NumPy semantics to represent shapes and axes. - `str` because einsum has this parameter and the op can be intensively used in transformer models. - `py_slice`, `Ellipsis`, `None` for basic indexing. We can do one step further by moving the whole indexing dispatch logic to backend. - np scalars. - `mx.context.Context`. One call of `mx.cpu()` can be as large as 600ns using ctypes. One thought is do it in the pybind way by creating a Python binding for the backend `Context` class. - `np.dtype`. Similar to `Context`. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568233782