yongfeng-nv edited a comment on issue #5367:
URL: https://github.com/apache/incubator-tvm/pull/5367#issuecomment-616035227
The current behavior of IntervalSet floormod(a, b) is rough -- it returns
[0, b-1], [-b+1, b-1], or everything. It causes extra iterations to my
schedule. This is a simplified version:
import tvm
from tvm import te
def test_bound_tile_mod():
def compute(M_tiles, N_tiles, factor, dtype):
# Algo
M = M_tiles * factor
N = N_tiles * factor
A = tvm.te.placeholder((N, M), name='A', dtype=dtype)
C = tvm.te.compute((N, M), lambda n, m: A[n, m], name='C')
s = tvm.te.create_schedule(C.op)
return s, A, C
def schedule(s, factor, padding, A, C):
C_local = s.cache_write(C, "local")
n, m = C.op.axis
bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
nio, nii = s[C].split(ni, 2)
n = s[C].fuse(nii, mi)
C_shared = s.cache_write(C, "shared")
bn, bm, ni, mi = C_shared.op.axis
s[C_shared].storage_align(ni, factor * 2, padding)
n, m = s[C].op.axis
bn, bm, ni, mi = s[C].tile(n, m, factor, factor)
s[C].set_scope("global")
niio, niii = s[C].split(ni, 32)
s[C_shared].compute_at(s[C], niio)
return s
s, A, C = compute(2, 2, 128, "float16")
s = schedule(s, 128, 8, A, C)
bounds = tvm.te.schedule.InferBound(s)
print(tvm.lower(s, [A, C], simple_mode=True))
It does 256x256 point-wise copying in 128x128 tiles from local to shared
then to global. I would like to allocate only a 32x128 shared memory and reuse
it four times per tile. In addition, I want to pad 8 data every two tile rows
to avoid bank conflicts in shared memory with storage_align. I expect C.shared
(the local ->shared stage) does exactly 256x256 copying, but the following IR
shows 4 times of that.
// attr [C.local] storage_scope = "local"
allocate C.local[float16 * 65536]
// attr [C.shared] storage_scope = "shared"
allocate C.shared[float16 * 16896]
for (n.c, 0, 256) {
for (m.c, 0, 256) {
C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
}
}
for (n.outer, 0, 2) {
for (m.outer, 0, 2) {
for (n.inner.outer, 0, 4) {
for (n.inner.outer.c, 0, 64) {
for (n.inner.inner.m.inner.fused.c, 0, 256) {
C.shared[((n.inner.outer.c*264) +
n.inner.inner.m.inner.fused.c)] = C.local[(((((n.outer*32768) +
(n.inner.outer.c*512)) + (floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) +
(m.outer*128)) + floormod(n.inner.inner.m.inner.fused.c, 128))]
}
}
for (n.inner.inner, 0, 32) {
for (m.inner, 0, 128) {
C[(((((n.outer*32768) + (n.inner.outer*8192)) +
(n.inner.inner*256)) + (m.outer*128)) + m.inner)] =
C.shared[((((n.inner.outer*4224) + (floordiv(n.inner.inner, 2)*264)) +
(floormod(n.inner.inner, 2)*128)) + m.inner)]
}
}
}
}
}
If there is another schedule to achieve this, please let me know. As of
now, I can only accomplish this by tiling both C.shared and C. C's compute now
involves several floormod: `C.shared(floordiv(n, 128), floordiv(m, 128),
floordiv(floormod(n, 128), 2), ((floormod(floormod(n, 128), 2)*128) +
floormod(m, 128)))`. `m` and `n` further bring in leaf IterVars to floormod to
PropBoundToInputs() during InferBound. Because the best result from
floormod(x, 128) is [0, 128-1], C.shared doesn't reduce domain, although it
does compute_at s[C]'s n.inner.outer with range [0,3].
With this PR, I am able to get expected IR -- `(n.inner.outer.c, 0, 64)`
becoming `(n.inner.outer.c, 0, 16)`:
// attr [C.local] storage_scope = "local"
allocate C.local[float16 * 65536]
// attr [C.shared] storage_scope = "shared"
allocate C.shared[float16 * 4224]
for (n.c, 0, 256) {
for (m.c, 0, 256) {
C.local[((n.c*256) + m.c)] = A[((n.c*256) + m.c)]
}
}
for (n.outer, 0, 2) {
for (m.outer, 0, 2) {
for (n.inner.outer, 0, 4) {
for (n.inner.outer.c, 0, 16) {
for (n.inner.inner.m.inner.fused.c, 0, 256) {
C.shared[((n.inner.outer.c*264) +
n.inner.inner.m.inner.fused.c)] = C.local[((((((n.outer*32768) +
(n.inner.outer*8192)) + (n.inner.outer.c*512)) +
(floordiv(n.inner.inner.m.inner.fused.c, 128)*256)) + (m.outer*128)) +
floormod(n.inner.inner.m.inner.fused.c, 128))]
}
}
for (n.inner.inner, 0, 32) {
for (m.inner, 0, 128) {
C[(((((n.outer*32768) + (n.inner.outer*8192)) +
(n.inner.inner*256)) + (m.outer*128)) + m.inner)] =
C.shared[(((floordiv(n.inner.inner, 2)*264) + (floormod(n.inner.inner, 2)*128))
+ m.inner)]
}
}
}
}
}
The improvement has two folds:
1. EvalSet takes a range map as input to help calculating floormod. For
example: `[x*4, x*4+3] % 8 = [x*4-x/2*8, x*4+3-(x*4+3)/8*8)]`, if we don't
know `x`'s range; but `[x*4, x*4+3]`, if we know `x`'s range is `[0, 1]`. This
is common from tiling.
2. For `a mod b`, if b is an IntImm, single point, and greater than 0, we
do the following:
// a mod b = a - b * (a/b) if
// (a) a_max - a_min < b, i.e. that before mod, a's range doesn't cover
[0, b)
// and (b) a_min mod b <= a_max mod b, i.e. that a's range is still
continuous after mod
so that `[13, 15] % 10 = [3, 5]`
IntervalSet evaluation is hard. These improvements are adhoc to the
problems I encountered. I am open to any alternatives.
Question: how about mod? It's implementation looks same as floormod.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]