[I] [Tracking Issue] Hopper TMA support for bulk asynchronous loads [tvm]

via GitHub Fri, 20 Oct 2023 08:51:33 -0700


adstraw opened a new issue, #15956:
URL: https://github.com/apache/tvm/issues/15956

### This issue is to track progress for Hopper TMA support for bulk
asynchronous loads

NVIDIA Hopper GPU adds support for a Tensor Memory Accelerator (TMA) engine
which implements a range of new features to improve memory latency. The
following is an excerpt from the [H100 GPU Architecture
Overview](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper).

> To help feed the powerful new H100 Tensor Cores, data fetch efficiency is
improved with a new Tensor Memory Accelerator (TMA) that can transfer large
blocks of data and multi-dimensional tensors from global memory to shared
memory and vice-versa.

> TMA operations are launched using a copy descriptor which specifies data
transfers using tensor dimensions and block coordinates instead of per-element
addressing. Large blocks of data (up to the shared memory capacity) can be
specified and loaded from global memory into shared memory or stored from
shared memory back to global memory. TMA significantly reduces addressing
overhead and improves efficiency with support for different tensor layouts
(1D-5D tensors), different memory access modes, reductions, and other features.

> The TMA operation is asynchronous and leverages the shared memory-based
asynchronous barriers introduced in A100. Additionally, the TMA programming
model is single-threaded, where a single thread in a warp is elected to issue
an asynchronous TMA operation (cuda::memcpy_async) to copy a tensor, and
subsequently multiple threads can wait on a cuda::barrier for completion of the
data transfer. To further improve performance, the H100 SM adds hardware to
accelerate these asynchronous barrier wait operations.

Hopper TMA implements a wide feature space. To keep this issue tractable it
will focus on a single feature - bulk asynchronous loads (global -> shared)
with barrier synchronization for 1D cases using memcpy (destination, source,
size) semantics. Specifically **not** addressed by this issue are:

- Bulk asynchronous stores (global -> shared)
- Tensor layouts (1D-5D tensors), different memory access modes, reductions,
and other features

Notably, Hopper bulk asynchronous loads **require** barrier based
synchronization whereas on Ampere asynchronous loads could be synchronized
either with barriers or group based methods.

TVM implements the group based synchronization method for Ampere
asynchronous loads where a `commit_group` instruction marks the end of a group
of `cp.async` instructions and the `wait_group` instruction is used to wait for
the completion of prior asynchronous copy operations.

The intention of this issue is to pivot TVM support for both Ampere and
Hopper to use barrier synchronization for asynchronous loads. Here are the
development items:

- [x] P1. [Hopper TMA] CUDA codegen for async copy with barrier
synchronization PR #15616
- [x] P2. [Hopper TMA] Add CUDA codegen support for bulk asynchronous copy
PR #15656
- [ ] P3. TIR transform for Hopper asynchronous copy; ideally with a
unifying approach for Hopper, Ampere
[inject_ptx_async_copy.cc](https://github.com/apache/tvm/blob/main/src/tir/transforms/inject_ptx_async_copy.cc)
and Hexagon
[lower_asyc_dma.cc](https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_async_dma.cc)
- [ ] P4. Modify InjectSoftwarePipeline pass to use barrier instead of group
synchronization for A100/H100
- [ ] P5. Modify InjectSoftwarePipeline pass and CUDA codegen to allow for
larger bulk load sizes > 16 bytes for H100

Confusingly, bulk asynchronous stores (shared -> global) which are also new
for Hopper **require** group (not barrier) based synchronization. Mentioning
this here to avoid over-optimizing for barrier synchronization as group
synchronization may still be required on Hopper. A first pass of bulk
asynchronous store support for Hopper in TVM might be "fire and forget" where
all stores are issued in a single commit group at the appropriate compute stage
(end of operator, pipeline stage) which must complete (wait group) before
proceeding to the next compute stage.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Tracking Issue] Hopper TMA support for bulk asynchronous loads [tvm]

Reply via email to