adstraw commented on code in PR #13719:
URL: https://github.com/apache/tvm/pull/13719#discussion_r1069928049


##########
python/tvm/tir/tensor_intrin/hexagon.py:
##########
@@ -20,6 +20,47 @@
 from .. import TensorIntrin
 
 
+def generate_dma_load_intrin(
+    size: int,
+    dtype: str,
+):
+    """Generator of dma_load intrins"""
+
+    @T.prim_func
+    def dma_load_desc(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            for i in T.serial(size):
+                with T.block("load"):
+                    vii = T.axis.remap("S", [i])
+                    C[vii] = A[vii]
+
+    @T.prim_func
+    def dma_load_impl(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            T.evaluate(
+                T.tvm_call_packed(
+                    "device_api.hexagon.dma_copy",
+                    0,

Review Comment:
   Synchronous DMA uses queue ID -1.  See 
[here](https://github.com/apache/tvm/blob/60c723ec267ee5095a35add5f8259e650b8ddd7b/src/runtime/hexagon/hexagon_user_dma.h#L37).
  This is so as not to interfere with async DMA flow which uses queue IDs 
starting with 0.  Please use queue -1 and add some comments here.



##########
python/tvm/tir/tensor_intrin/hexagon.py:
##########
@@ -20,6 +20,47 @@
 from .. import TensorIntrin
 
 
+def generate_dma_load_intrin(
+    size: int,
+    dtype: str,
+):
+    """Generator of dma_load intrins"""
+
+    @T.prim_func
+    def dma_load_desc(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            for i in T.serial(size):
+                with T.block("load"):
+                    vii = T.axis.remap("S", [i])
+                    C[vii] = A[vii]
+
+    @T.prim_func
+    def dma_load_impl(a: T.handle, c: T.handle) -> None:

Review Comment:
   sync_dma_load_impl



##########
tests/python/contrib/test_hexagon/test_vtcm_bandwidth.py:
##########
@@ -123,15 +125,18 @@ class TestMatMulVec:
 
     # Removed most of these to speedup CI.
     size = tvm.testing.parameter(
-        # 10 * KB,
+        128,
+        256,
+        1024,
+        10 * KB,
         # 20 * KB,
         # 40 * KB,
         # 80 * KB,
         # 160 * KB,
         # 320 * KB,
         640 * KB,
         # MB,
-        # 2 * MB,

Review Comment:
   Did you mean to uncomment this?  Makes the test run longer in CI.



##########
tests/python/contrib/test_hexagon/test_vtcm_bandwidth.py:
##########
@@ -104,8 +106,8 @@ def evaluate(hexagon_session, sch, size):
     )
 
     # These are reduced for CI but number=100 and repeat=10 does a good job of 
removing noise.
-    number = 1
-    repeat = 1
+    number = 10
+    repeat = 10

Review Comment:
   Did you mean to change this?  Makes the test run longer in CI.



##########
python/tvm/tir/tensor_intrin/hexagon.py:
##########
@@ -20,6 +20,47 @@
 from .. import TensorIntrin
 
 
+def generate_dma_load_intrin(
+    size: int,
+    dtype: str,
+):
+    """Generator of dma_load intrins"""
+
+    @T.prim_func
+    def dma_load_desc(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            for i in T.serial(size):
+                with T.block("load"):
+                    vii = T.axis.remap("S", [i])
+                    C[vii] = A[vii]
+
+    @T.prim_func
+    def dma_load_impl(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            T.evaluate(
+                T.tvm_call_packed(
+                    "device_api.hexagon.dma_copy",
+                    0,
+                    T.address_of(C[0], dtype="handle"),
+                    T.address_of(A[0], dtype="handle"),
+                    size,
+                    0,
+                    dtype="int32",
+                )
+            )
+            T.evaluate(T.tvm_call_packed("device_api.hexagon.dma_wait", 0, 0, 
dtype="int32"))

Review Comment:
   Queue = -1.  Comments that Wait(queue, 0) means to wait for the queue to 
drain which is the sum total of the previous dma_copy.



##########
python/tvm/tir/tensor_intrin/hexagon.py:
##########
@@ -20,6 +20,47 @@
 from .. import TensorIntrin
 
 
+def generate_dma_load_intrin(
+    size: int,
+    dtype: str,
+):
+    """Generator of dma_load intrins"""
+
+    @T.prim_func
+    def dma_load_desc(a: T.handle, c: T.handle) -> None:

Review Comment:
   Would like this to be called "sync_dma_load_desc" with some comments to 
distinguish between async and sync (copy and immediate wait) flow.



##########
python/tvm/tir/tensor_intrin/hexagon.py:
##########
@@ -20,6 +20,47 @@
 from .. import TensorIntrin
 
 
+def generate_dma_load_intrin(
+    size: int,
+    dtype: str,
+):
+    """Generator of dma_load intrins"""
+
+    @T.prim_func
+    def dma_load_desc(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            for i in T.serial(size):
+                with T.block("load"):
+                    vii = T.axis.remap("S", [i])
+                    C[vii] = A[vii]
+
+    @T.prim_func
+    def dma_load_impl(a: T.handle, c: T.handle) -> None:
+        A = T.match_buffer(a, (size), dtype, offset_factor=1, scope="global")
+        C = T.match_buffer(c, (size), dtype, offset_factor=1, 
scope="global.vtcm")
+        with T.block("root"):
+            T.reads(A[0:size])
+            T.writes(C[0:size])
+            T.evaluate(
+                T.tvm_call_packed(
+                    "device_api.hexagon.dma_copy",
+                    0,
+                    T.address_of(C[0], dtype="handle"),
+                    T.address_of(A[0], dtype="handle"),
+                    size,
+                    0,

Review Comment:
   Need comments, at least to indicate that this is for bypass.  Better would 
be to tie the setting of this bit to `tir.experimental_dma_bypass_cache` 
annotation.



##########
python/tvm/tir/tensor_intrin/hexagon.py:
##########
@@ -163,3 +204,27 @@ def dot_product_32x2_i16i16i32_vdmpy(a: T.handle, b: 
T.handle, c: T.handle) -> N
 
 VRMPY_u8i8i32_VTCM_INTRIN = "dot_32x4_u8i8i32_vtcm_vrmpy"
 TensorIntrin.register(VRMPY_u8i8i32_VTCM_INTRIN, 
*generate_dot_product_32x4_u8i8i32("global.vtcm"))
+
+DMA_READ_1_u8 = "dma_read_1_u8"

Review Comment:
   I don't see users for most of these.  Seems like it might be better to 
delete and allow users to create what is needed based on the test case or 
schedule?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to