masahi commented on pull request #7233:
URL: https://github.com/apache/tvm/pull/7233#issuecomment-757006403


   The second text block is an excerpt from the output of `nvprof 
--print-gpu-trace`, showing elapsed time, launch config etc of each kernel 
executed, in order.
   
   I don't have other benchmark other than the data from MaskRCNN. For the 
first kernel of 4D scatter, since it is just a memcpy, I don't see why we 
should do threading differently than other injective ops. I hope we don't need 
thorough benchmarking to justify this change. 
   
   > Would it be a better idea to have to separate scatter implementations (the 
parallel one and the sequential one) and let autotvm figure out which is 
better? Then we don't have to have all this special casing and magic input 
sizes.
   
   hmm, this sounds better than picking a random threshold, but do we have 
existing uses of autotvm to make such decision? Given that scatter kernels are 
extern, I'm not sure if autotvm can work with them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to