janetsc commented on code in PR #13844:
URL: https://github.com/apache/tvm/pull/13844#discussion_r1092261997


##########
src/tir/transforms/lower_async_dma.cc:
##########
@@ -192,19 +198,33 @@ class AsyncDMALowerer : public StmtExprMutator {
       // save queue ID for inspection in `wait` transform
       queue_ids_.insert(queue_id);
 
-      return Evaluate(Call(DataType::Int(32), builtin::dma_copy(),
-                           {queue_id,
-                            Call(DataType::Handle(), builtin::address_of(),
-                                 {BufferLoad(bufferstorenode->buffer, 
store_index)}),
-                            Call(DataType::Handle(), builtin::address_of(),
-                                 {BufferLoad(bufferloadnode->buffer, 
load_index)}),
-                            for_loop->extent * bufferloadnode->dtype.bytes(), 
dma_bypass_cache_}));
+      auto call_dma_copy =
+          Evaluate(Call(DataType::Int(32), builtin::dma_copy(),
+                        {queue_id,
+                         Call(DataType::Handle(), builtin::address_of(),
+                              {BufferLoad(bufferstorenode->buffer, 
store_index)}),
+                         Call(DataType::Handle(), builtin::address_of(),
+                              {BufferLoad(bufferloadnode->buffer, 
load_index)}),
+                         for_loop->extent * bufferloadnode->dtype.bytes(), 
dma_bypass_cache_}));
+
+      // if the buffer we are about to DMA was modified by the primfunc
+      // then we need to flush the buffer from the cache prior to the DMA

Review Comment:
   So the next PR could have:
   
   1.  Unit tests that expose the root of the problem (if they don't, we should 
go back to the drawing board)
   2.  Additions to the matmul test that expose the need to invalidate in the 
VTCM to DDR direction
   3. Changes to insert flush in the right place and invalidate in the right 
place.
   
   The main reason I was advocating to do both directions now is that I'd like 
to confirm that those two unit tests do indeed fail.  If they don't, we'll need 
to do more experiments to get to the root of it.  The unit tests (from our 
offline discussion):
   
   DDR to VTCM
       Allocate two buffers in DDR and VTCM
       Write "0xbeefbeef" to the DDR buffer.  Flush.
       Write "0xfaceface" to the DDR buffer.  Do NOT flush.
       DMA from DDR to VTCM with BYPASS ON.
   Compare DDR and VTCM buffers.  They should not match, because the real 
values you wanted in DDR weren't flushed to DDR.
   
   VTCM to DDR
       Allocate two buffers in DDR and VTCM
       Write "0xbeefbeef" to the DDR buffer.  (You can flush or not - it 
actually won't matter.)
       Write "0xfaceface" to the VTCM buffer.
       DMA from VTCM to DDR with BYPASS ON.
   Compare DDR and VTCM buffers.  They will not match because there were stale 
values in the L2 cache for the DDR buffer, so it won't pick up what you wrote 
with BYPASS ON.  DDR needed to be invalidated first.
   
   All that said, with a flush-only operation for DMA, I'll sign off on this as 
a good incremental step.  Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to