tmoreau89 commented on a change in pull request #8825: URL: https://github.com/apache/tvm/pull/8825#discussion_r696206243
########## File path: tutorials/optimize/opt_gemm.py ########## @@ -227,43 +228,49 @@ ################################################################################################### # Array Packing # ------------- -# Another important trick is array packing. This trick is to reorder the storage dimension of the -# array to convert the continuous access pattern on certain dimension to a sequential pattern after +# Another important trick is array packing. This trick is to reorder the storage dimension of an +# array to convert the continuous access pattern on its dimensions to a sequential pattern after # flattening. # # .. image:: https://github.com/dmlc/web-data/raw/main/tvm/tutorial/array-packing.png # :align: center # +# NOTE: The figure above is meant for illustration purposes only. Please ignore dimension +# information ([16][16] and [16/4][16][4]) which does not apply to this tutorial. ################################################################################################### -# Just as it is shown in the figure above, after blocking the computations, we can observe the array -# access pattern of B (after flattening), which is regular but discontinuous. We expect that after -# some transformation we can get continuous access pattern. We can reorder a [16][16] array to -# a [16/4][16][4] array, so that the access pattern of B will be sequential when grabing -# the corresponding value from the packed array. -# +# We can use array packing to address the access pattern for B. Observe the array access pattern of +# B after flattening which is not sequential as we iterate over the K dimension. We can reorder B +# with dimensions [K][N] so that it has dimensions [N/bn][K][bn] where bn is the blocking factor and +# also the vector size for both B in the inner loop. This reorder splits N into two dimensions --- +# bigN (N/bn) and littleN (bn) --- and the new dimensions [N/bn][K][bn] match the indexing of B +# from outer to inner loops (no, ko, ki, ni) resulting in a sequential access pattern for B after +# flattening. + # We have to re-write the algorithm slightly. -packedB = te.compute((N / bn, K, bn), lambda x, y, z: B[y, x * bn + z], name="packedB") +packedB = te.compute( + (N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name="packedB" +) C = te.compute( (M, N), - lambda x, y: te.sum(A[x, k] * packedB[y // bn, k, tvm.tir.indexmod(y, bn)], axis=k), + lambda m, n: te.sum(A[m, k] * packedB[n // bn, k, tvm.tir.indexmod(n, bn)], axis=k), name="C", ) s = te.create_schedule(C.op) -xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) +mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) (k,) = s[C].op.reduce_axis -ko, ki = s[C].split(k, factor=4) +ko, ki = s[C].split(k, factor=kfactor) -s[C].reorder(xo, yo, ko, xi, ki, yi) -s[C].vectorize(yi) +s[C].reorder(mo, no, ko, mi, ki, ni) +s[C].vectorize(ni) -x, y, z = s[packedB].op.axis -s[packedB].vectorize(z) -s[packedB].parallel(x) +bigN, k, littleN = s[packedB].op.axis Review comment: ```suggestion bigN, _, littleN = s[packedB].op.axis ``` I don't think that `k` is used further. Plus we wouldn't want to confuse it with the reduction axis defined in the compute definition, see line 101. WDYT @adstraw -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
