[GitHub] [tvm] AndrewZhaoLuo opened a new pull request, #14167: [CUDA][Schedule] Better Layout Transform Schedules

via GitHub Wed, 01 Mar 2023 15:48:31 -0800


AndrewZhaoLuo opened a new pull request, #14167:
URL: https://github.com/apache/tvm/pull/14167


   CUDA: Improved Layout Transform Schedule
   
   ALL numbers are on an RTX 3070 unless stated otherwise.
   
   ## Motivation
   The default implementation of layout transform has poor performance in some 
scenarios. A motivating factor behind all of this were layout transform 
operators taking up significant time in some stable diffusion models we tried. 
NHWC layout is required for tensor-core use so we had to convert the layout of 
conv. operators. These introduced operators and their fused and unfused 
versions were extremely slow. I ended up improving latency by at least 2x for 
all these operators:
   
   *** TABLE HERE ***
   
   ## Algorithm 
   Currently, layout transform relies on the AutoBind schedule rule which seems 
to guarantee contiguous writes to the output buffer when assigning thread and 
block indices to loops. However in the case of layout transforms where the 
inner dimensions of `src_layout` and `dst_layout` do not match, it is 
impossible to have contiguous writes and contiguous reads in the same 
operation. 
   
   We ignore the case where the inner dimension matches (though I believe the 
new schedule might be faster in some scenarios).
   
   ### Simple Case: Transpose
   The way around this is to use a small amount of shared memory and tile loads 
so that reads from global memory -> shared can be coalesced. We carefully load 
elements so writing from shared —> global memory also have coalesced access. An 
example is taking the transpose of the matrix [2048, 1024] —> [1024, 2048]. We 
can read 32 x 32 tiles of the src matrix. We might make rows of our tile of 
shared memory correspond to coalesced accesses of the src matrix, then columns 
of our shared memory tile correspond to coalesced accesses of the dst matrix. 
By doing this, we can maximize memory throughput of our operations.
   
   ### General Case
   While a simple transpose is easy to read about, how do we guarantee this 
behavior for general layout transforms where dimensions can be arbitrarily 
thrown around? 
   
   The answer is we can make use of analysis tools in TVM scheduling! 
Specifically if we read from global memory adn write into our shared memory 
correctly, setting up the loop structure so that the outer loop will eventually 
be bound to blockIdx’s, we can set up our read loop and then use `compute_at` 
or `reverse_compute_at` to automatically generate the proper extants for 
writing loops!
   
   We therefore only have to care about getting the loop structure for the read 
blocks correctly so that the eventual write block can also have coalesced 
memory access. We have a constraints to think about here. 
   
   Reads from global memory can be coalesced as much as possible up to the 
tile_size
   Writes to global memory can be coalesced as much as possible up to the 
tile_size
   Each thread will read up to tile_size elements from global memory
   Each thread will write up to tile_size elements into global memory
   
   We can guarantee the proper number of reads and writes by tiling our loops 
like so:
   
   ```
   for block_idx, thread_idx, inner_loop in T.grid(remainder, tile_size, 
tile_size):
       # each combo of all indices reads one element
   ```
   
   Guaranteeing coalesced reads is also simple. Essentially the inner most 
loops must access the inner dimensions of src_layout. For example, let’s say we 
have a matrix 
   [2_d, 128, 64, 32, 8, 4, 2, 2_s] (_s and _d refer to the inner most 
dimension of src and dst layout respectively) and we want to do a layout 
transform from “ABCDEFGH” —> HBCDFEGA” (A and I swapped, E and F swapped). Then 
for writing we want our inner loop of extant tile_size to correspond to the 
inner most 4 dimensions of the source layout. E.g. we divide the dimension of 8 
into one of 4 x 2 and combine the loops of extant 2, 4, 2, 2_s into one of 32. 
   
   Guaranteeing coalesced writes is done similarly, except we must consider 
things from the point of view of the dst_layout:
   [2_s, 128, 64, 32, 4, 8, 2, 2_d] (first and last dimension swapped, as are 
the 4 and the 8). 
   
   We can use similar logic, and get combining the dimensions [8, 2, 2_d] to 
get our tile_size. However, notice we already utilized some of these loops in 
guaranteeing coalesced reads! We cannot reuse the same factors or else the loop 
will read and write a different number of elements. So instead we use the 
remaining factors left until we also hit our tile_size. 
   
   We now have our read and write dimension tiled to get those constraints as 
close as possible. To handle weirder shapes which don’t divide nicely into 
tile_size, we pad some dimensions until it divides into tile_size. 
Incidentally, the above example with a dype of “float16” and a tile_size of 32 
has a runtime of 667 us vs. the default 993 us and a 90% memory throughput 
reported from nvidia compute tool!
   
   ## Choice of search space
   — The most natural tile sizes are those aligned with coalesced transacation 
limits in global memory of 32, 64, and 128 bytes. A block size of 128 threads 
is too high (as each thread would also be required for 128 elements of work) 
though would hit the transaction limit with 1-byte datatypes. 64 is a natural 
upper limit to tile size. Due to the nature of factoring, other tile size 
between 1 and 64 might be better for various shapes. It’s relatively small 
search space so I have elected to try all tile sizes from 1 to 64 inclusive + 
default autobind implementation. #search
   
   ## Known defects:
   — High Memory use from common factors:
    
   Analysis when using `compute_at` seems to fail (or I am missing something) 
if we sample factors from the same dimensions for both dim0 and dim1 tiling. 
This leads to excessive shared memory use *sometimes* which can cause failure 
or lead to performance issues. A lot of times these are still faster than the 
default schedules still but occasionally they are much higher than I expect. 
These are a very tiny amount of tested cases however and we always try both the 
new schedule and AutoBind so it should be ok for now.
   
   I am not sure why this happens and would need more investigation. An example 
is transposing [1209, 9] of any type and 32 tile size. 
   
   — Shared memory — Bank Conflicts:
   
   Shared memory bank conflicts exist and are common for the strategy used. 
Consider a simple transpose of a [2048, 1024] matrix to [1024, 2048]. Then with 
a tile size of 32, the shared memory buffer will be of shape [32, 32]. We might 
read from global memory into rows of shared memory, and then write columns of 
shared memory to global memory for contiguous access. However, note the columns 
of this shared memory buffer lie on the same memory bank! 
   
   A common solution is to simply pad the innermost dimension of shared memory. 
E.g. [32, 33]. Which now makes accesses along columns be bank-conflct free.
   
   This is planned to be done in a future PR via a new scheduling rule and is a 
general problem throughout all CUDA generated schedules. To give an idea of 
impact a [1024, 2048] transpose went from 14.34us —> 12.35us after this change 
basing off the optimized layout transform described in this PR.
   
   — Non-aligned inner dimensions:
   
   This is an issue I did not think of when writing this schedule. The schedule 
is done from the viewpoint of trying to maximize coalesced memory access in 
global memory. However, one small detail is coalesced memory access must be 
aligned to the size of the transcation. That is, if we have a coalesced access 
of 64 bytes (e.g. 32 float16’s), then the each address accessesd must be on the 
same 64 byte line (e.g. only the last 6 bits of address may be different). 
   
   Consider a layout transform where dimensions are prime numbers. E.g. [3, 
1024, 1024, 7] -> [7, 1024, 1024, 3]. Then the current strategy will read 7 
element-wide chunks at a time. However, most accesses will occur across 
coalesced memory boundaries, resulting in two coalesced memory requests instead 
of just 1. E.g. let’s say coalesced memory must be 8 byte aligned and we are 
dealing with one-byte datatype. The first read of 7 elements might be 0x00, 
0x01 … 0x07 and the next will be 0x08, 0x09 … 0x0E. For the second accesss, 
0x08 belongs to the first 8-byte line, while 0x09…0x0E belong to the second 
8-byte line, requiring two memory transactions.
   
   One possible way to get around this is to treat the array as flattened and 
just access stuff coalesced, though I am not sure about the details, to 
guarantee good access for src and dst will require some thinkinging, though it 
might be possible. 
   
   E.g.
   An interesting thing in this case is if we do the no-op reshape into [3, 
1024, 32, 32, 7] and then into [3, 1024, 32, 32 * 7], then [3, 1024, 32, 7, 
32]. Then things become obvious. However, trying something like this initially 
leads to weird calculated bounds in the compute_at step and excessive shared 
memory usage as we must also consider the dst_layout.
   
   ## Results:
   Good stuff 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] AndrewZhaoLuo opened a new pull request, #14167: [CUDA][Schedule] Better Layout Transform Schedules

Reply via email to