reminisce commented on a change in pull request #8566: optimize broadcast
URL: https://github.com/apache/incubator-mxnet/pull/8566#discussion_r149839401
 
 

 ##########
 File path: src/operator/mxnet_op.h
 ##########
 @@ -345,6 +394,13 @@ __global__ void mxnet_generic_kernel(int N, Args... args) 
{
   }
 }
 
+template<typename OP, typename ...Args>
+__global__ void mxnet_generic_kernel_ex(int N, Args... args) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N; i += blockDim.x * 
gridDim.x) {
+    OP::Map(i, 1, args...);
 
 Review comment:
   In the case slice op, I benchmarked two different approaches for GPU: one is 
each thread working on a single element (frequent unravel and ravel calls like 
here with length = 1), the other one is similar to what you did for CPU, it 
turns out that the latter is a little bit faster. There are about 50,000 slice 
op's output elements.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to