AndrewZhaoLuo opened a new pull request, #14167:
URL: https://github.com/apache/tvm/pull/14167
CUDA: Improved Layout Transform Schedule
ALL numbers are on an RTX 3070 unless stated otherwise.
## Motivation
The default implementation of layout transform has poor performance in some
scenarios. A motivating factor behind all of this were layout transform
operators taking up significant time in some stable diffusion models we tried.
NHWC layout is required for tensor-core use so we had to convert the layout of
conv. operators. These introduced operators and their fused and unfused
versions were extremely slow. I ended up improving latency by at least 2x for
all these operators:
*** TABLE HERE ***
## Algorithm
Currently, layout transform relies on the AutoBind schedule rule which seems
to guarantee contiguous writes to the output buffer when assigning thread and
block indices to loops. However in the case of layout transforms where the
inner dimensions of `src_layout` and `dst_layout` do not match, it is
impossible to have contiguous writes and contiguous reads in the same
operation.
We ignore the case where the inner dimension matches (though I believe the
new schedule might be faster in some scenarios).
### Simple Case: Transpose
The way around this is to use a small amount of shared memory and tile loads
so that reads from global memory -> shared can be coalesced. We carefully load
elements so writing from shared —> global memory also have coalesced access. An
example is taking the transpose of the matrix [2048, 1024] —> [1024, 2048]. We
can read 32 x 32 tiles of the src matrix. We might make rows of our tile of
shared memory correspond to coalesced accesses of the src matrix, then columns
of our shared memory tile correspond to coalesced accesses of the dst matrix.
By doing this, we can maximize memory throughput of our operations.
### General Case
While a simple transpose is easy to read about, how do we guarantee this
behavior for general layout transforms where dimensions can be arbitrarily
thrown around?
The answer is we can make use of analysis tools in TVM scheduling!
Specifically if we read from global memory adn write into our shared memory
correctly, setting up the loop structure so that the outer loop will eventually
be bound to blockIdx’s, we can set up our read loop and then use `compute_at`
or `reverse_compute_at` to automatically generate the proper extants for
writing loops!
We therefore only have to care about getting the loop structure for the read
blocks correctly so that the eventual write block can also have coalesced
memory access. We have a constraints to think about here.
Reads from global memory can be coalesced as much as possible up to the
tile_size
Writes to global memory can be coalesced as much as possible up to the
tile_size
Each thread will read up to tile_size elements from global memory
Each thread will write up to tile_size elements into global memory
We can guarantee the proper number of reads and writes by tiling our loops
like so:
```
for block_idx, thread_idx, inner_loop in T.grid(remainder, tile_size,
tile_size):
# each combo of all indices reads one element
```
Guaranteeing coalesced reads is also simple. Essentially the inner most
loops must access the inner dimensions of src_layout. For example, let’s say we
have a matrix
[2_d, 128, 64, 32, 8, 4, 2, 2_s] (_s and _d refer to the inner most
dimension of src and dst layout respectively) and we want to do a layout
transform from “ABCDEFGH” —> HBCDFEGA” (A and I swapped, E and F swapped). Then
for writing we want our inner loop of extant tile_size to correspond to the
inner most 4 dimensions of the source layout. E.g. we divide the dimension of 8
into one of 4 x 2 and combine the loops of extant 2, 4, 2, 2_s into one of 32.
Guaranteeing coalesced writes is done similarly, except we must consider
things from the point of view of the dst_layout:
[2_s, 128, 64, 32, 4, 8, 2, 2_d] (first and last dimension swapped, as are
the 4 and the 8).
We can use similar logic, and get combining the dimensions [8, 2, 2_d] to
get our tile_size. However, notice we already utilized some of these loops in
guaranteeing coalesced reads! We cannot reuse the same factors or else the loop
will read and write a different number of elements. So instead we use the
remaining factors left until we also hit our tile_size.
We now have our read and write dimension tiled to get those constraints as
close as possible. To handle weirder shapes which don’t divide nicely into
tile_size, we pad some dimensions until it divides into tile_size.
Incidentally, the above example with a dype of “float16” and a tile_size of 32
has a runtime of 667 us vs. the default 993 us and a 90% memory throughput
reported from nvidia compute tool!
## Choice of search space
— The most natural tile sizes are those aligned with coalesced transacation
limits in global memory of 32, 64, and 128 bytes. A block size of 128 threads
is too high (as each thread would also be required for 128 elements of work)
though would hit the transaction limit with 1-byte datatypes. 64 is a natural
upper limit to tile size. Due to the nature of factoring, other tile size
between 1 and 64 might be better for various shapes. It’s relatively small
search space so I have elected to try all tile sizes from 1 to 64 inclusive +
default autobind implementation. #search
## Known defects:
— High Memory use from common factors:
Analysis when using `compute_at` seems to fail (or I am missing something)
if we sample factors from the same dimensions for both dim0 and dim1 tiling.
This leads to excessive shared memory use *sometimes* which can cause failure
or lead to performance issues. A lot of times these are still faster than the
default schedules still but occasionally they are much higher than I expect.
These are a very tiny amount of tested cases however and we always try both the
new schedule and AutoBind so it should be ok for now.
I am not sure why this happens and would need more investigation. An example
is transposing [1209, 9] of any type and 32 tile size.
— Shared memory — Bank Conflicts:
Shared memory bank conflicts exist and are common for the strategy used.
Consider a simple transpose of a [2048, 1024] matrix to [1024, 2048]. Then with
a tile size of 32, the shared memory buffer will be of shape [32, 32]. We might
read from global memory into rows of shared memory, and then write columns of
shared memory to global memory for contiguous access. However, note the columns
of this shared memory buffer lie on the same memory bank!
A common solution is to simply pad the innermost dimension of shared memory.
E.g. [32, 33]. Which now makes accesses along columns be bank-conflct free.
This is planned to be done in a future PR via a new scheduling rule and is a
general problem throughout all CUDA generated schedules. To give an idea of
impact a [1024, 2048] transpose went from 14.34us —> 12.35us after this change
basing off the optimized layout transform described in this PR.
— Non-aligned inner dimensions:
This is an issue I did not think of when writing this schedule. The schedule
is done from the viewpoint of trying to maximize coalesced memory access in
global memory. However, one small detail is coalesced memory access must be
aligned to the size of the transcation. That is, if we have a coalesced access
of 64 bytes (e.g. 32 float16’s), then the each address accessesd must be on the
same 64 byte line (e.g. only the last 6 bits of address may be different).
Consider a layout transform where dimensions are prime numbers. E.g. [3,
1024, 1024, 7] -> [7, 1024, 1024, 3]. Then the current strategy will read 7
element-wide chunks at a time. However, most accesses will occur across
coalesced memory boundaries, resulting in two coalesced memory requests instead
of just 1. E.g. let’s say coalesced memory must be 8 byte aligned and we are
dealing with one-byte datatype. The first read of 7 elements might be 0x00,
0x01 … 0x07 and the next will be 0x08, 0x09 … 0x0E. For the second accesss,
0x08 belongs to the first 8-byte line, while 0x09…0x0E belong to the second
8-byte line, requiring two memory transactions.
One possible way to get around this is to treat the array as flattened and
just access stuff coalesced, though I am not sure about the details, to
guarantee good access for src and dst will require some thinkinging, though it
might be possible.
E.g.
An interesting thing in this case is if we do the no-op reshape into [3,
1024, 32, 32, 7] and then into [3, 1024, 32, 32 * 7], then [3, 1024, 32, 7,
32]. Then things become obvious. However, trying something like this initially
leads to weird calculated bounds in the compute_at step and excessive shared
memory usage as we must also consider the dst_layout.
## Results:
Good stuff
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]