Lunderberg commented on a change in pull request #39:
URL: https://github.com/apache/tvm-rfcs/pull/39#discussion_r745735959



##########
File path: rfcs/0039-buffer-physical-layout.md
##########
@@ -0,0 +1,640 @@
+- Feature Name: Buffer Physical Layout
+- Authors: Eric Lunderberg (@Lunderberg), Wuwei Lin (@vinx13)
+- Start Date: 2021-10-05
+- RFC PR: [apache/tvm-rfcs#0039](https://github.com/apache/tvm-rfcs/pull/0039)
+- GitHub Issue: Not Yet Written
+- Related RFCs: [RFC#0042](https://github.com/apache/tvm-rfcs/pull/0042)
+
+# Summary
+[summary]: #summary
+
+This RFC introduces layout transformations that can be applies to a
+buffer during the lowering process.  These transformations will be
+part of the schedule, allowing the same compute definition to be used
+across multiple different layouts.
+
+[RFC#0042](https://github.com/apache/tvm-rfcs/pull/0042) is intended
+to make these buffer transformations easier to write, though it isn't
+strictly necessary for this change.
+
+# Motivation
+[motivation]: #motivation
+
+Currently, TVM assumes that all buffers can be treated as flat memory.
+That is, while a tensor may have N dimensions, the underlying buffer
+allocated by the low-level codegen has a single value defining the
+size, and access into that buffer is done using a single index.  This
+assumptions holds for most cases, such as a CPU accessing RAM, but
+doesn't hold in all cases.  For example, texture memory on a GPU
+requires two indices to access.
+
+In addition, computations that are semantically identical (e.g. 2-d
+convolution) require independent compute definitions and schedules
+(e.g. `conv2d_nchw` and `conv2d_hwcn`) based on the format of the data
+accepted as input.
+
+This RFC introduces a mechanism to specify transformations to be
+applied to the layout of buffers in memory, including the option to
+present multi-dimensional indices to the low-level code generators.
+This will allow for target-specific handling of non-flat memory, and
+will allow for code re-use across compute definitions that differ only
+in memory layout.
+
+# Guide-level explanation
+[guide-level-explanation]: #guide-level-explanation
+
+A buffer is represented by a `tvm::tir::Buffer` object, and has some
+shape associated with it.  This shape is initially defined from the
+buffer's shape in the compute definition.  Buffers can either be
+allocated within a `tvm::tir::PrimFunc` using a `tvm::tir::Allocate`
+node, or can be passed in as parameters to a `PrimFunc`.  Buffer
+access is done using `tvm::tir::BufferLoad` and
+`tvm::tir::BufferStore` for reads and writes, respectively.
+
+When a TIR graph is passed into the low-level code generator
+`tvm::codegen::Build`, the dimensionality of each buffer must be
+supported by the target code generator.  Typically, this will mean
+generating a 1-dimensional index representing access into flat memory.
+Some code generators may attach alternative semantics for
+multi-dimensional buffers (e.g. 2-d buffers to represent texture
+memory on OpenCL).  A low-level code generator should check the
+dimensionality of the buffers it is acting on, and give a diagnostic
+error for unsupported dimensionality.
+
+To define the layout transformation in a TE schedule, use the
+`transform_layout` method of a schedule, as shown below.  The
+arguments to `transform_layout` is a function that accepts a list of
+`tvm.tir.Var` representing a logical index, and outputs a list of
+`tvm.tir.PrimExpr` giving a corresponding physical index.  If
+`transform_layout` isn't called, then no additional layout
+transformations are applied.
+
+For example, below defines the reordering from NHWC logical layout to
+NCHWc physical layout.  Similar to `cache_read` and `cache_write`, the
+`transform_layout` method introduces a new stage in the schedule.
+
+```python
+# Compute definition, written in terms of NHWC logical axes
+B = te.compute(A.shape, lambda n,h,w,c: A[n,h,w,c])
+s = te.create_schedule(B.op)
+
+def nhwc_to_nchwc(logical_axes):
+    n,h,w,c = logical_axes
+    return [n, c//4, h, w, c%4]
+
+B_nchwc = s[B].transform_layout(nhwc_to_nchwc)
+
+# Compute definition that would produce an equivalent physical layout
+B_equivalent = te.compute(
+    [A.shape[0], A.shape[3]//4, A.shape[1], A.shape[2], 4],
+    lambda n, c_outer, h, w, c_inner: A[n, h, w, 4*c_outer+c_inner],
+)
+```
+
+By default, after all explicitly specified layout transformations are
+applied, all axes are flattened to a single axis by following a
+row-major traversal.  This produces a 1-d buffer, which corresponds to
+flat memory.  To add additional dimensions in the physical layout,
+insert `te.AXIS_SEPARATOR` into the axis list return by the physical
+layout function.  These define groups of axes, where each group is
+combined into a single physical axis.
+
+```python
+B = te.compute(shape=(M,N,P,Q), ...)
+s = te.create_schedule(B.op)
+
+# Default, produces a 1-d allocation with shape (M*N*P*Q,)
+s[B].transform_layout(lambda m,n,p,q: [m,n,p,q])
+
+# One separator, produces a 2-d allocation with shape (M*N, P*Q).
+s[B].transform_layout(lambda m,n,p,q: [m, n, te.AXIS_SEPARATOR, p, q])
+
+# Two separators, produces a 3-d allocation with shape (M, N*P, Q).
+s[B].transform_layout(lambda m,n,p,q: [m, te.AXIS_SEPARATOR, n, p, 
te.AXIS_SEPARATOR, q])
+
+# Can be used along with reorders and splits.
+s[B].transform_layout(lambda m,n,p,q: [m, q//4, n, te.AXIS_SEPARATOR, p, q%4])
+```
+
+
+The `te.AXIS_SEPARATOR` object exists only within the API interface,
+and is not part of the representation of the layout transformation
+within the generated TIR graph.  Instead, the TIR graph will contain
+an integer list of axis separators, to be used when flattening buffers
+to device-supported dimensions in the `StorageFlatten` or
+`FlattenBuffer` passes.
+
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+Transformation of a buffer is represented by the attribute
+`"buffer_layout_transformations"` in the `PrimFunc` attributes.  This
+is a map whose keys are buffer var to be reshaped, and whose values
+are the transformations to be applied.  Many of the utilities
+needed for this transformation already exist in `iter_affine_map.h`,
+and are used in the implementation.
+
+A buffer may be allocated with `AllocateNode`, and may be interacted
+with using `BufferLoadNode` and `BufferStoreNode`.
+`BufferRealizeNode` should only appear in TE-based schedules, and
+should be converted to `AllocateNode`.  `LoadNode` and `StoreNode`
+are deprecated.
+
+## Impacted TIR Nodes
+
+- BufferNode
+  - Describes a N-d buffer.  The layout of the buffer may be
+
+- BufferRealizeNode
+  - Realization of a buffer, in logical layout.
+  - For external buffers, serves as an optional annotation.  For
+    internal buffers, results in allocation of memory.
+
+
+- BufferLoadNode/BufferStoreNode
+  - Read/write of a buffer.
+
+  - Change from previous behavior: Will exist throughout the lowering
+    process, and will be passed to the low-level code generators.
+    Transformations that previously created `Load` and `Store` nodes
+    will instead create `BufferLoad` and `BufferStore` nodes with 1-d
+    indices.
+
+
+
+- AllocateNode
+  - Allocation of a buffer, in physical layout.
+
+  - Declares an allocation of a buffer.
+
+  - Change from previous behavior: Previously, `AllocateNode` held the
+    `buffer_var`, datatype, and buffer extents directly. After
+    implementation of this RFC, `AllocateNode` will instead hold the
+    `Buffer` that is to be allocated.
+
+
+- LoadNode/StoreNode
+  - Read/write of a 1-d buffer, given a `Var` pointer to the start of
+    the buffer and a single index.
+
+  - Deprecated, should instead use `BufferLoad` and `BufferStore` with
+    a 1-d index.
+
+
+## Impacted tir Transformations
+
+- `ApplyBufferTransforms`
+  - A new pass that takes as input a TIR graph that may have buffer
+    transformations stored in the `PrimFunc` attributes.  Returns
+    a TIR graph with all buffer transforms applied as specified.
+
+  - Rewrite `indices` in BufferStore/BufferLoad nodes based on the
+    specified transformation.
+
+  - The transformations are stored as a `Map<Var, Array<IndexMap>>` in
+    the `"buffer_layout_transformations"` attribute of a primfunc.
+    All buffers whose `BufferNode::data` is a key in this map should
+    have their physical layout rewritten.  If the array contains
+    multiple transformations, they are applied sequentially.
+
+    A possible structure for the `IndexMap` node is shown
+    below.
+
+    ```
+    class IndexMapNode : public Object {
+    public:
+      /*! \brief Variables representing the indices prior to remapping.
+       *
+       * If initial_index is empty, then final_index should also be
+       * empty, and no mapping is applied.
+       */
+      Array<Var> initial_index;
+
+      /*!
+       * \brief Expressions defining the indices after remapping.
+       *
+       * These expressions should only be in terms of the initial_index,
+       * and must be expressible as a `tvm::arith::IterSumExpr`.  The
+       * mapping from `initial_index` to `final_index` must be injective.
+       *
+       * If final_index is empty, then initial_index should also be
+       * empty, and the map is an identity function.
+       */
+      Array<PrimExpr> final_index;
+    };
+    ```
+
+  - After applying the transformations, the
+    `"buffer_layout_transformations"` attribute should be removed.
+    This ensures that additional application of
+    `ApplyBufferTransforms` has no effect.
+
+- FlattenBuffer/StorageFlatten
+
+  - Existing passes that convert from logical layout to physical
+    layout for TE schedules (StorageFlatten) or TensorIR schedules
+    (FlattenBuffer).
+
+  - The transformations are stored as a `Map<Var, Array<IntImm>>` in
+    the `"buffer_axis_separators"` attribute of a primfunc.  All
+    buffers whose `BufferNode::data` is a key in this map should be
+    flattened to an output buffer of dimension
+    `separators[buf->data].size()+1`.  All other buffers should be
+    flattened to a 1-d output buffer.
+
+  - After flattening a buffer to an N-d output, the corresponding
+    value in the `"buffer_axis_separators"` attribute should be set to
+    `range(N-1)`.  This ensures that repeated application of the
+    flattening passes have no additional effect.  (The attribute
+    shouldn't be deleted entirely, as that would cause a flattened
+    buffer with `N` dimensions and an unflattened buffer with `N`
+    dimensions to have identical representations.)
+
+
+## Examples
+
+The following are intended as pseudo-code, and exclude details not
+relevant to this RFC (e.g. dtype).  These do not correspond with the
+final version of TensorIR that implements this RFC.  Numeric values
+are shown unsimplified to indicate where they come from.
+
+The first example shows a 2-d buffer with no layout transformations
+explicitly specified.  The generated `PrimFunc` has no
+`"buffer_layout_transformations"` attribute, and so the default
+behavior is used, applying a row-major traversal to generate a flat
+1-d buffer.
+
+```python
+# In TE schedule, no call to transform_layout.
+
+# Initial TIR graph
+x = Buffer(name="x", shape=[2,3])
+with Allocate(x):
+    val = BufferLoad(x, [10, 15])
+    BufferStore(x, 7, [20, 23])
+
+# After flattening to 1-d
+x = Var(name="x")
+with Allocate(x, shape=[2*3]):
+    val = BufferLoad(x, [10*3 + 15])
+    BufferStore(x, 7, [20*3 + 23])
+```
+
+This next example shows a 2-d logical buffer, which is lowered to a
+1-d physical buffer.  `transform_layout` has been used to define a
+physical layout whose fastest changing dimension corresponds to the
+first index in the logical layout.
+
+```python
+# In TE schedule
+# s[x].transform_layout(lambda i,j: [j,i])
+
+# Initial TIR graph
+attrs["buffer_layout_transformations"][x] = lambda i,j: [j,i]
+x = Buffer(name="x", shape=[2,3])
+with Allocate(x):
+    val = BufferLoad(x, [10, 15])
+    BufferStore(x, 7, [20, 23])
+
+# After applying the explicit reordering
+x = Buffer(name="x", shape=[3,2])
+with Allocate(x):
+    val = BufferLoad(x, [15, 10])
+    BufferStore(x, 7, [23, 20])
+
+# After flattening to 1-d
+x = Var(name="x")
+with Allocate(x, shape=[3*2]):
+    val = BufferLoad(x, [15*2 + 10])
+    BufferStore(x, 7, [23*2 + 20])
+```
+
+The next example shows a remapping from NHWC logical layout to NCHWc
+physical layout.  The 4 logical axes are expanded to 5 logical axes
+during the `ApplyBufferTransforms` pass, then flattened into 1 physical
+axis during StorageFlatten/FlattenBuffer.
+
+```python
+# In TE schedule
+# s[x].transform_layout(lambda n,h,w,c: [n, c//4, h, w, c%4])
+
+# Initial TIR graph
+attrs["buffer_layout_transformations"][x] = lambda n,h,w,c: [n, c//4, h, w, 
c%4]
+x = Buffer(name="x", shape=[16,64,64,128], reorder_splits=nhwc_to_nchwc, 
axis_separators=[])
+with Allocate(x):
+    val = BufferLoad(x, [11, 37, 23, 101])
+
+# After applying the explicit reordering
+x = Buffer(name="x", shape=[16, 128/4, 64, 64, 4], reorder_splits=[], 
axis_separators=[])
+with Allocate(x):
+    val = BufferLoad(x, index=[11, floor(101/4), 37, 23, 101%4])
+
+# After flattening to 1-d
+x = Var(name="x")
+with Allocate(x, shape=[16 * (128/4) * 64 * 64 * 4]):
+    val = BufferLoad(x, index=[(128/4)*64*64*4*11 + 64*64*4*floor(101/4) + 
64*4*37 + 4*23 + 101%4])
+```
+
+Lastly, an example of remapping from `NHWC` logical layout to `NCHWc`
+physical layout, packed into a 2-d physical layout with `NCH` in the
+first physical axis and `Wc` in the second physical axis.  This is the
+definition used by the current `"global.texture"` definition used for
+texture memory.  The change applied during SplitReorderIndices is
+identical to the previous example, but StorageFlatten produces a 2-d
+physical index.  The interpretation of this 2-d index depends on the
+target-specific codegen.
+
+```python
+# In TE schedule
+# s[x].transform_layout(lambda n,h,w,c: [n, c//4, h, te.AXIS_SEPARATOR, w, 
c%4])
+
+# Initial TIR graph
+attrs["buffer_layout_transformations"][x] = lambda n,h,w,c: [n, c//4, h, w, 
c%4]
+attrs["buffer_axis_separators"][x] = [2]
+x = Buffer(name="x", shape=[16,64,64,128])
+with Allocate(x):
+    val = BufferLoad(x, [11, 37, 23, 101])
+
+# After applying the explicit reordering.
+attrs["buffer_axis_separators"][x] = [2]
+x = Buffer(name="x", shape=[16, 128/4, 64, 64, 4])
+with Allocate(x):
+    val = BufferLoad(x, index=[11, floor(101/4), 37, 23, 101%4])
+
+# After applying StorageFlatten or FlattenBuffer.  The final result is
+# 2-d, due to the te.AXIS_SEPARATOR used in the `.transform_layout`.
+# The `"buffer_axis_separators"` attribute is set to [0], to
+# distinguish this 2-d flattened buffer from a 2-d unflattened buffer.
+attrs["buffer_axis_separators"][x] = [0]
+x = Var(name="x")
+with Allocate(x, shape=[16 * (128/4) * 64, 64 * 4]):
+    val = BufferLoad(x, index=[(128/4)*64*11 + 64*floor(101/4) + 37, 4*23 + 
101%4])
+```
+
+
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+This change may make it more difficult to reason about the memory

Review comment:
       Agreed entirely.  My goal has been to have the default behavior, both 
for TE and for TIR, remain the same.  The differences in memory access patterns 
would only occur when explicitly opted into, with `Schedule.transform_layout` 
in TE, or with the annotations in TIR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to