adstraw opened a new issue, #15956:
URL: https://github.com/apache/tvm/issues/15956

   ### This issue is to track progress for Hopper TMA support for bulk 
asynchronous loads
   
   NVIDIA Hopper GPU adds support for a Tensor Memory Accelerator (TMA) engine 
which implements a range of new features to improve memory latency.  The 
following is an excerpt from the [H100 GPU Architecture 
Overview](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper).
   
   > To help feed the powerful new H100 Tensor Cores, data fetch efficiency is 
improved with a new Tensor Memory Accelerator (TMA) that can transfer large 
blocks of data and multi-dimensional tensors from global memory to shared 
memory and vice-versa.
   
   > TMA operations are launched using a copy descriptor which specifies data 
transfers using tensor dimensions and block coordinates instead of per-element 
addressing. Large blocks of data (up to the shared memory capacity) can be 
specified and loaded from global memory into shared memory or stored from 
shared memory back to global memory. TMA significantly reduces addressing 
overhead and improves efficiency with support for different tensor layouts 
(1D-5D tensors), different memory access modes, reductions, and other features.
   
   > The TMA operation is asynchronous and leverages the shared memory-based 
asynchronous barriers introduced in A100. Additionally, the TMA programming 
model is single-threaded, where a single thread in a warp is elected to issue 
an asynchronous TMA operation (cuda::memcpy_async) to copy a tensor, and 
subsequently multiple threads can wait on a cuda::barrier for completion of the 
data transfer. To further improve performance, the H100 SM adds hardware to 
accelerate these asynchronous barrier wait operations.
   
   Hopper TMA implements a wide feature space.  To keep this issue tractable it 
will focus on a single feature - bulk asynchronous loads (global -> shared) 
with barrier synchronization for 1D cases using memcpy (destination, source, 
size) semantics.  Specifically **not** addressed by this issue are:
   
   - Bulk asynchronous stores (global -> shared)
   - Tensor layouts (1D-5D tensors), different memory access modes, reductions, 
and other features
   
   Notably, Hopper bulk asynchronous loads **require** barrier based 
synchronization whereas on Ampere asynchronous loads could be synchronized 
either with barriers or group based methods.
   
   TVM implements the group based synchronization method for Ampere 
asynchronous loads where a `commit_group` instruction marks the end of a group 
of `cp.async` instructions and the `wait_group` instruction is used to wait for 
the completion of prior asynchronous copy operations.
   
   The intention of this issue is to pivot TVM support for both Ampere and 
Hopper to use barrier synchronization for asynchronous loads.  Here are the 
development items:
   
   - [x] P1. [Hopper TMA] CUDA codegen for async copy with barrier 
synchronization PR #15616 
   - [x] P2. [Hopper TMA] Add CUDA codegen support for bulk asynchronous copy 
PR #15656 
   - [ ] P3. TIR transform for Hopper asynchronous copy; ideally with a 
unifying approach for Hopper, Ampere 
[inject_ptx_async_copy.cc](https://github.com/apache/tvm/blob/main/src/tir/transforms/inject_ptx_async_copy.cc)
 and Hexagon 
[lower_asyc_dma.cc](https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_async_dma.cc)
   - [ ] P4. Modify InjectSoftwarePipeline pass to use barrier instead of group 
synchronization for A100/H100
   - [ ] P5. Modify InjectSoftwarePipeline pass and CUDA codegen to allow for 
larger bulk load sizes > 16 bytes for H100
   
   Confusingly, bulk asynchronous stores (shared -> global) which are also new 
for Hopper **require** group (not barrier) based synchronization.  Mentioning 
this here to avoid over-optimizing for barrier synchronization as group 
synchronization may still be required on Hopper.  A first pass of bulk 
asynchronous store support for Hopper in TVM might be "fire and forget" where 
all stores are issued in a single commit group at the appropriate compute stage 
(end of operator, pipeline stage) which must complete (wait group) before 
proceeding to the next compute stage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to