anko-intel commented on issue #17971: URL: https://github.com/apache/incubator-mxnet/issues/17971#issuecomment-633696597
Hi @djaym7 Thank you for your results. I observe some similarity in the results measured locally on Skylake-X i9-7920X on MxNet 1.6.0 cu102mkl binary. The only exception is the time for 512x512 tensor on MxNet(?). MxNet compiled from master branch (on b2144777b - fix (#18313)) uses MKL if available, and the results are much better. But Mxnet is still worse than NumPy for smaller tensors.  Additional measurements on the master with MxNet Profiler enabled show that > 80us is spent between python and time noted by Profiler for dot operation. It seems to be an already know issue #14883 and #17097 regarding passing python/C++ barrier. For me it sounds like fixing python-MXNet binding overhead issue should also fix this issue.  Results in table below, neglecting measurement noise, shows that differences between time measured in python and MKL are almost the same as between python and MXNet Profiler, so it confirms python <-> C++ API issue.  In the last table there are results for MxNet when both profiler and MKL verbose are enabled (adding additional time for both measurements). We can see here that the difference between python time and profile time is similar to the results in the previous tables and it is the most significant one.  Exact results of my measurements could be find in logs: [dot_issue_logs.zip](https://github.com/apache/incubator-mxnet/files/4678496/dot_issue_logs.zip) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
