csullivan opened a new pull request #7686:
URL: https://github.com/apache/tvm/pull/7686


   This PR introduces 2d texture memory support for TE, TIR, OpenCL codegen, 
and the OpenCL Device API. 
   
   The lowering process is driven principally by the TextureFlatten TIR 
lowering pass which flattens multi-dimensional loads and stores to two 
dimensional accesses. The texture lowering is agnostic to the utilization of 
the texture memory, e.g. via `cache_read("texture")` as a temporary scratch 
pad, or from a runtime tir::Buffer Var of TextureType.
   
   **Layout**
   - The current lowering path assumes two dimensional textures will always 
have the lowest dimension of size 4 (RGBA); in this way, the use of packed 
layouts for activations and weights are required. In [Part 2] schedules are 
introduced using NCHW4c and OIHW4o to take advantage of the texture lowering 
path. 
   - Separate Nd to 2d lowering conventions are employed for activation and for 
weights. For activations all dimensions save the last (excluding the vector 
length 4 dimension) are packed into columns of the texture image; for weights 
all dimensions save the first are packed into rows of the texture image: 
   ```
   scope == "texture" :: [A, B, C, D, 4] -> [A*B*C, D, 4]
   scope == "texture:weight" :: [A, B, C, D] -> [A, B*C*D, 4]
   ```
   Additionally, if any other layout lowering is desired, one can apply logical 
shape changes via a te.compute in the topi compute definition to ensure the 
flattening occurs as desired.
   
   - These two lowering conventions are used in TIR lowering, the opencl device 
runtime, and in the graph runtime, and are implemented in 
"src/runtime/texture.h". 
   - 
   
   **Codegen**
   - OpenCL codegen of `write_image` and `read_image` on image2d_t's of type 
CL_FLOAT and CL_HALF types and RGBA channel order are supported. 
   - A pass is added in code generation to infer `__read_only` and 
`__write_only` access qualifiers for the utilized texture buffers, resulting in 
kernel signatures like, 
   
   ```
   __kernel void fused_nn_kernel0(__read_only image2d_t placeholder0,  ..., 
__write_only image2d_t compute,) 
   ```
   
   - To avoid extra texture reads, SSA intermediates are utilized when an 
explicit cache_read to local memory is not specified. 
   - Single element index into an RGBA texture read is supported to enable 
outer product style broadcasting.
   
   **Device runtime**
   - The device runtime supports allocating texture memory both as a temporal 
workspace and as a runtime data space. In the latter case, special invocation 
of AllocDataSpace with a memory_scope == "texture(:weight)" is required. 
Special memory scopes were added to the runtime in 
https://github.com/apache/tvm/pull/7488. 
   - Workspace allocations are handled via a set of idle texture pools which 
are grown to match the requested sizes. The strategy employed is to first pick 
the pool which requires the least amount of extra space beyond, and then to 
minimize the amount of wasted space that growing a two dimensional pool may 
incur. A similar approach is taken for the ahead of time graph runtime memory 
planner for data space allocations (see: PR#).
   - CopyFromTo support is expanded to handle the case of directly reading from 
/ writing to image buffers from host. 
   
   
   RFC in progress, once posted I will add a link here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to