[GitHub] [tvm] lazycal opened a new issue #8540: [BUG] CUDA misaligned address when vectorizing float16

GitBox Fri, 23 Jul 2021 08:10:26 -0700


lazycal opened a new issue #8540:
URL: https://github.com/apache/tvm/issues/8540



   TVM seems to have some issue on the alignment requirement when doing 
vectorization on CUDA. Script to reproduce:
   ```python
   import tvm
   from tvm import relay
   import numpy as np
   from tvm.contrib import graph_executor
   
   
   def gen_one_random_feed(module):
       feed = {}
       for i in range(module.get_num_inputs()):
           data = np.random.rand(*module.get_input(i).shape)
           feed[i] = data
       return feed
   
   
   def run(mod):
       target = "cuda"
       lib = relay.build(mod, target=target)
       dev = tvm.cuda()
       rt_mod = graph_executor.GraphModule(lib["default"](dev))
       input_dict = gen_one_random_feed(rt_mod)
       for k, v in input_dict.items():
           rt_mod.set_input(k, v)
       rt_mod.run()
   
   
   def get_mod():
       x = relay.var('x', shape=(1, 3,), dtype='float16')
       t1 = relay.strided_slice(x, [0, 1], [1, 3], [1, 1])
       func = relay.Function([x], t1)
       return tvm.IRModule.from_expr(func)
   
   
   mod = get_mod()
   run(mod)
   print('Passed!')
   ```
   
   The generated CUDA kernel code:
   ```cuda
   extern "C" __global__ void tvmgen_default_fused_strided_slice_kernel0(half* 
__restrict__ T_strided_slice, half* __restrict__ placeholder) {
     if (((int)threadIdx.x) < 1) {
       ((uint1*)(T_strided_slice + ((((int)threadIdx.x) * 2))))[0] = 
((uint1*)(placeholder + (((((int)threadIdx.x) * 3) + 1))))[0];
     }
   }
   ```
   It reads 2 `half`s starting from an odd offset of `placeholder`, which is 
not aligned to 4 bytes (assuming `placeholder` is at least 4-byte aligned). 
   
   ## Some analysis and possible fix
   After some close look, I found that this was caused by the vectorization 
introduced here: 
https://github.com/apache/tvm/blob/main/python/tvm/topi/cuda/injective.py#L55. 
Setting `vector_width=1` avoids the issue. I also managed to craft a TE and 
schedule that triggers the bug, for better illustration:
   
   ```python
   import tvm
   from tvm import te
   import numpy as np
   
   def get_te_sch():
       N = 3
       C_N = N - 1
       A = te.placeholder((N,), name="A", dtype='float16')
       C = te.compute((C_N,), lambda i: A[i+1], name="C")
   
       s = te.create_schedule(C.op)
       oi, ii = s[C].split(C.op.axis[0], factor=2)
       s[C].bind(oi, te.thread_axis("threadIdx.x"))
       s[C].vectorize(ii) # BUG: misalignment
       return N, C_N, s, A, C
   
   def build_run():
       N, C_N, s, A, C = get_te_sch()
   
       tgt = tvm.target.Target(target="cuda", host="llvm")
       foo = tvm.build(s, [A, C], tgt, name="foo")
       dev = tvm.device(tgt.kind.name, 0)
   
       a_data = np.arange(0, N).astype(A.dtype)
       a = tvm.nd.array(a_data, dev)
       c = tvm.nd.array(np.zeros(C_N, dtype=C.dtype), dev)
       foo(a, c)
       expected = a_data[1:C_N+1]
       assert np.allclose(c.numpy(), expected), 
f"expected={expected}\nactual={c}"
   
   build_run()
   print('passed')
   ```
   Notice that the compute of C is a vector load starting from an not 4-byte 
aligned address. This gets lowered into 
   ```cpp
     attr = {"global_symbol": "foo", "tir.noalias": True}
     buffers = {C: Buffer(C_2: Pointer(float16), float16, [2], []),
                A: Buffer(A_2: Pointer(float16), float16, [3], [])}
     buffer_map = {A_1: A, C_1: C} {
     realize(C, [0:2], True {
       attr [IterVar(threadIdx.x: int32, (nullptr), "ThreadIndex", 
"threadIdx.x")] "thread_extent" = 1;
       for (i.inner: int32, 0, 2) "vectorized" {
         C[i.inner] = A[(i.inner + 1)]
       }
     })
   }
   ```
   followed by the `VectorizeLoop` pass that vectorizes the loop into 
``C_2[ramp(0, 1, 2)] = (float16x2*)A_2[ramp(1, 1, 2)]``, which then gets 
codegen-ed into misaligned vector load of `A_2[1:3]`.
   
   I'm not sure which pass should handle the alignments? I suspect it should be 
the reponsibility of codegen? Because at 
[CodeGenLLVM](https://github.com/apache/tvm/blob/07243a89df3cdac6afaf29a74aba398190a76c72/src/target/llvm/codegen_llvm.cc#L1167)
 I do see alignment handling, but do not see at [CodeGenCUDA's visit 
LoadNode](https://github.com/apache/tvm/blob/07243a89df3cdac6afaf29a74aba398190a76c72/src/target/source/codegen_c.cc#L701).
 I guess this [if 
branch](https://github.com/apache/tvm/blob/07243a89df3cdac6afaf29a74aba398190a76c72/src/target/source/codegen_c.cc#L711-L713)
 is generating vector load, so maybe a fix would be adding alignment check at 
its condition? The else branch does not generate vector load so disabling the 
if-branch also works fine in my case.
   
   
   ## Environment
   - TVM: commit 07243a89df3cdac6afaf29a74aba398190a76c72
   - CUDA version: 10.0
   - System: Ubuntu 16.04
   - GCC 5.4
   - Build options: ``-DUSE_RELAY_DEBUG=ON -DUSE_CUBLAS=ON -DUSE_LLVM=ON 
-DUSE_CUDA=ON``
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] lazycal opened a new issue #8540: [BUG] CUDA misaligned address when vectorizing float16

Reply via email to