masahi edited a comment on pull request #7233:
URL: https://github.com/apache/tvm/pull/7233#issuecomment-757006403
The second text block is an excerpt from the output of `nvprof
--print-gpu-trace`, showing elapsed time, launch config etc of each kernel
executed, in order. The first line is for the initialization kernel, the second
one the actual scatter kernel.
I don't have other benchmark other than the data from MaskRCNN. For the
first kernel of 4D scatter, since it is just a memcpy, I don't see why we
should do threading differently than other injective ops. I hope we don't need
thorough benchmarking to justify this change. After this change, the trace
becomes
```
31.2518s 495.68us (12250 1 1) (1024 1 1) 8 0B
0B - - - - GeForce GTX 107
1 7 fused_scatter_1_kernel0 [2980]
31.2523s 522.78us (1 256 7) (32 1 1) 16 0B
0B - - - - GeForce GTX 107
1 7 fused_scatter_1_kernel1 [2982]
```
> Would it be a better idea to have to separate scatter implementations (the
parallel one and the sequential one) and let autotvm figure out which is
better? Then we don't have to have all this special casing and magic input
sizes.
hmm, this sounds better than picking a random threshold, but do we have
existing uses of autotvm to make such decision? Given that scatter kernels are
extern, I'm not sure if autotvm can work with them.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]