@ptrendx Yes, there is an effort of profiling engine code flow using VTune. We 
hope the exercise can pinpoint the hotspots that contribute to the most part of 
latency. Further time split for pure C++ part between setup code (shape/type 
inference, memory allocation, dependency setup) and op scheduling is also 
around 50% vs. 50%.

For the "fast path" data structures, I'm summarizing the items as follows 
(including the ones suggested by @sxjscience):

- `tuple` and `list` since they can be interchangeable in NumPy semantics to 
represent shapes and axes.
- `str` because einsum has this parameter and the op can be intensively used in 
transformer models.
- `py_slice`, `Ellipsis`, `None` for basic indexing. We can do one step further 
by moving the whole indexing dispatch logic to backend.
- np scalars.
- `mx.context.Context`. One call of `mx.cpu()` can be as large as 600ns using 
ctypes. One thought is do it in the pybind way by creating a Python binding for 
the backend `Context` class.
- `np.dtype`. Similar to `Context`.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568233782

Reply via email to