[GitHub] [incubator-tvm] yongfeng-nv edited a comment on issue #5367: Improve IntervalSet's floormod

GitBox Mon, 20 Apr 2020 22:38:40 -0700


yongfeng-nv edited a comment on issue #5367:
URL: https://github.com/apache/incubator-tvm/pull/5367#issuecomment-616035227



   The current behavior of IntervalSet floormod(a, b) is rough -- it returns 
[0, b-1], [-b+1, b-1], or everything.  It causes extra iterations to my 
schedule.  This is a simplified version:
   
       import tvm
       from tvm import te
       
       def test_bound_tile_mod():
           def compute(M_tiles, N_tiles, factor, dtype):
               # Algo
               M = M_tiles * factor
               N = N_tiles * factor
       
               A = tvm.te.placeholder((N, M), name='A', dtype=dtype)
               C = tvm.te.compute((N, M), lambda n, m: A[n, m], name='C')
               s = tvm.te.create_schedule(C.op)
       
               return s, A, C
       
           def schedule(s, factor, padding, A, C):
               C_local = s.cache_write(C, "local")
       
               n, m = C.op.axis
               bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
               nio, nii = s[C].split(ni, 2)
               n = s[C].fuse(nii, mi)
               C_shared = s.cache_write(C, "shared")
               bn, bm, ni, mi = C_shared.op.axis       
               s[C_shared].storage_align(ni, factor * 2, padding)
       
               n, m = s[C].op.axis
               bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
               s[C].set_scope("global")
               niio, niii = s[C].split(ni, 32)
               s[C_shared].compute_at(s[C], niio)
       
               return s
       
           s, A, C = compute(2, 2, 128, "float16")
           s = schedule(s, 128, 8, A, C)
           bounds = tvm.te.schedule.InferBound(s)
           print(tvm.lower(s, [A, C], simple_mode=True))
   
   It does 256x256 point-wise copying in 128x128 tiles from local to shared 
then to global.  I would like to allocate only a 32x128 shared memory and reuse 
it four times per tile.  In addition, I want to pad 8 data every two tile rows 
to avoid bank conflicts in shared memory with storage_align.  I expect C.shared 
(the local ->shared stage) does exactly 256x256 copying, but the following IR 
shows 4 times of that.
   
       // attr [C.local] storage_scope = "local"
       allocate C.local[float16 * 65536]
       // attr [C.shared] storage_scope = "shared"
       allocate C.shared[float16 * 16896]
       for (n.c, 0, 256) {
         for (m.c, 0, 256) {
           C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
         }
       }
       for (n.outer, 0, 2) {
         for (m.outer, 0, 2) {
           for (n.inner.outer, 0, 4) {
             for (n.inner.outer.c, 0, 64) {
               for (n.inner.inner.m.inner.fused.c, 0, 256) {
                 C.shared[((n.inner.outer.c*264) + 
n.inner.inner.m.inner.fused.c)] = C.local[(((((n.outer*32768) + 
(n.inner.outer.c*512)) + (floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) + 
(m.outer*128)) + floormod(n.inner.inner.m.inner.fused.c, 128))]
               }
             }
             for (n.inner.inner, 0, 32) {
               for (m.inner, 0, 128) {
                 C[(((((n.outer*32768) + (n.inner.outer*8192)) + 
(n.inner.inner*256)) + (m.outer*128)) + m.inner)] = 
C.shared[((((n.inner.outer*4224) + (floordiv(n.inner.inner, 2)*264)) + 
(floormod(n.inner.inner, 2)*128)) + m.inner)]
               }
             }
           }
         }
       }
               
   If there is another schedule to achieve this, please let me know.  As of 
now, I can only accomplish this by tiling both C.shared and C.  C's compute now 
involves several floormod: `C.shared(floordiv(n, 128), floordiv(m, 128), 
floordiv(floormod(n, 128), 2), ((floormod(floormod(n, 128), 2)*128) + 
floormod(m, 128)))`.  `m` and `n` further bring in leaf IterVars to floormod to 
PropBoundToInputs() during InferBound.  Because the best result from 
floormod(x, 128) is [0, 128-1], C.shared doesn't reduce domain, although it 
does compute_at s[C]'s n.inner.outer with range [0,3].
   
   With this PR, I am able to get expected IR -- `(n.inner.outer.c, 0, 64)` 
becoming `(n.inner.outer.c, 0, 16)`:
   
       // attr [C.local] storage_scope = "local"
       allocate C.local[float16 * 65536]
       // attr [C.shared] storage_scope = "shared"
       allocate C.shared[float16 * 4224]
       for (n.c, 0, 256) {
         for (m.c, 0, 256) {
           C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
         }
       }
       for (n.outer, 0, 2) {
         for (m.outer, 0, 2) {
           for (n.inner.outer, 0, 4) {
             for (n.inner.outer.c, 0, 16) {
               for (n.inner.inner.m.inner.fused.c, 0, 256) {
                 C.shared[((n.inner.outer.c*264) + 
n.inner.inner.m.inner.fused.c)] = C.local[((((((n.outer*32768) + 
(n.inner.outer*8192)) + (n.inner.outer.c*512)) + 
(floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) + (m.outer*128)) + 
floormod(n.inner.inner.m.inner.fused.c, 128))]
               }
             }
             for (n.inner.inner, 0, 32) {
               for (m.inner, 0, 128) {
                 C[(((((n.outer*32768) + (n.inner.outer*8192)) + 
(n.inner.inner*256)) + (m.outer*128)) + m.inner)] = 
C.shared[(((floordiv(n.inner.inner, 2)*264) + (floormod(n.inner.inner, 2)*128)) 
+ m.inner)]
               }
             }
           }
         }
       }
   
   
   The improvement has two folds:
   1.   EvalSet takes a range map as input to help calculating floormod.  For 
example:  `[x*4, x*4+3] % 8 = [x*4-x/2*8, x*4+3-(x*4+3)/8*8)]`, if we don't 
know `x`'s range; but `[x*4, x*4+3]`, if we know `x`'s range is `[0, 1]`.  This 
is common from tiling.
         
   2.   For `a mod b`, if b is an IntImm, single point, and greater than 0, we 
do the following:
   
       // a mod b = a - b * (a/b) if
       // (a) a_max - a_min < b, i.e. that before mod, a's range doesn't cover 
[0, b)
       // and (b) a_min mod b <= a_max mod b, i.e. that a's range is still 
continuous after mod
   so that `[13, 15] % 10 = [3, 5]`
   
   IntervalSet evaluation is hard.  These improvements are adhoc to the 
problems I encountered.  I am open to any alternatives.
   
   Question: how about mod?  It's implementation looks same as floormod.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-tvm] yongfeng-nv edited a comment on issue #5367: Improve IntervalSet's floormod

Reply via email to