masahi edited a comment on pull request #7233:
URL: https://github.com/apache/tvm/pull/7233#issuecomment-757006403


   The second text block is an excerpt from the output of `nvprof 
--print-gpu-trace`, showing elapsed time, launch config etc of each kernel 
executed, in order. The first line is for the initialization kernel, the second 
one the actual scatter kernel.
   
   I don't have benchmark other than the data from MaskRCNN. For the first 
kernel of 4D scatter, since it is just a memcpy, I don't see why we should do 
threading differently than other injective ops. I hope we don't need thorough 
benchmarking to justify this change.  After this change, the trace becomes 
(only the first line change, note the elapsed time and thread launch config).
   
   ```
   31.2518s  495.68us          (12250 1 1)      (1024 1 1)         8        0B  
      0B         -           -           -           -  GeForce GTX 107         
1         7  fused_scatter_1_kernel0 [2980]
   31.2523s  522.78us            (1 256 7)        (32 1 1)        16        0B  
      0B         -           -           -           -  GeForce GTX 107         
1         7  fused_scatter_1_kernel1 [2982]
   ```
   
   > Would it be a better idea to have to separate scatter implementations (the 
parallel one and the sequential one) and let autotvm figure out which is 
better? Then we don't have to have all this special casing and magic input 
sizes.
   
   hmm, this sounds better than picking a random threshold, but do we have 
existing uses of autotvm to make such decision? Given that scatter kernels are 
extern, I'm not sure if autotvm can work with them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to