[GitHub] [incubator-tvm] boh-inspur commented on a change in pull request #5428: [CODEGEN][CUDA] Fix a bug when vectorized load&store was involved for…

GitBox Tue, 28 Apr 2020 02:11:55 -0700


boh-inspur commented on a change in pull request #5428:
URL: https://github.com/apache/incubator-tvm/pull/5428#discussion_r416455958




##########
File path: src/target/source/codegen_cuda.cc
##########
@@ -274,9 +274,21 @@ void CodeGenCUDA::PrintVecElemLoad(
   static const char access[] = {'x', 'y', 'z', 'w'};
   CHECK(i >= 0 && i < (t.is_float16() ? 8 : 4));
   if ((t.is_int()) && t.bits() == 8) {
-    os << "((char)(" << vec << " >> " << i * 8 << "))";
+    if (t.lanes() == 1) {
+      os << vec;
+    } else if (t.lanes() == 2) {
+      os << vec << "." << access[i % 2];
+    } else {
+      os << "((char)(" << vec << " >> " << i * 8 << "))";
+    }
   } else if ((t.is_uint()) && t.bits() == 8) {
-    os << "((unsigned char)(" << vec << " >> " << i * 8 << "))";
+    if (t.lanes() == 1) {

Review comment:
       @wpan11nv , I've realized the type with 'int',  and there is something 
wrong.
   When the type is int8*2，the cuda code seems correct and also build correctly.
   `
   #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 610)
   
   #include <sm_61_intrinsics.h>
   #endif
   extern "C" __global__ void default_function_kernel0(void* __restrict__ B, 
void* __restrict__ A) {
     int _1;
     {
       int _2 = (( int*)(( signed char*)A + (((((int)blockIdx.x) * 16) + 
(((int)threadIdx.x) * 2)))))[0];
       int _3 = (int)16843009;
       _1=((((char)(_2 >> 0))+((char)(_3 >> 0))) << 0);
       _1=_1 & ~(0x000000ff << 8) |((((char)(_2 >> 8))+((char)(_3 >> 8))) << 8);
     }
     (( int*)(( signed char*)B + (((((int)blockIdx.x) * 16) + 
(((int)threadIdx.x) * 2)))))[0] = _1;
   }
   `
   But there is a runtime error when copy from gpu to cpu memory : Check 
failed: e == cudaSuccess || e == cudaErrorCudartUnloading: misaligned address.  
   Do you have any advice?
   
   And if we use int32_t, when the type is int8*2, and if the size of the 
tensor is very huge, it will wastes lots of memory which is not nessary, and 
the runtime resource is more important. So I think based on the current code, 
just need a small modification that can support int8*3 by using char3 if needed 
support char3. What about your opinion?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-tvm] boh-inspur commented on a change in pull request #5428: [CODEGEN][CUDA] Fix a bug when vectorized load&store was involved for…

Reply via email to