[GitHub] [tvm] masahi commented on a diff in pull request #12544: [MetaSchedule] Add software pipeline in CUDA tensor core auto tensorization

GitBox Mon, 22 Aug 2022 17:52:49 -0700


masahi commented on code in PR #12544:
URL: https://github.com/apache/tvm/pull/12544#discussion_r952043594



##########
src/meta_schedule/schedule_rule/multi_level_tiling_tensor_core.cc:
##########
@@ -286,6 +293,54 @@ std::vector<State> 
MultiLevelTilingTensorCoreNode::AddReadReuseTensorCore(
   return {state};
 }
 
+std::vector<State> MultiLevelTilingTensorCoreNode::AddSoftwarePipeline(
+    TensorCoreState state) const {
+  if (!use_software_pipeline) {
+    return {state};
+  }
+  // The current config is not suitable for software pipelining.
+  if (r_indices_.size() < 2) {
+    return {state};
+  }
+
+  Schedule& sch = state->sch;
+  // Check reduction length after blockize.
+  int64_t reduction_length = 1;
+  for (int r_index : r_indices_) {
+    const Array<LoopRV>& tiles = state->tiles[r_index];
+    for (const LoopRV& tile : tiles) {
+      const auto* extent = sch->Get(tile)->extent.as<IntImmNode>();
+      ICHECK(extent != nullptr) << "Dynamic extent is not supported.";
+      reduction_length *= extent->value;
+    }
+  }
+  if (reduction_length <= 1) {
+    return {state};
+  }
+
+  // Add local stage and double buffering
+  for (int i = 0; i < 2; ++i) {
+    const tir::BlockRV cache_read = state->read_reuse.at(i);
+    sch->Annotate(cache_read, tir::attr::manifest_shared_memory_local_stage, 
Bool(true));
+    sch->Annotate(cache_read, tir::attr::double_buffer_scope, Integer(0));
+  }
+  // Add annotations of software pipeline
+
+  // Inner software pipeline: Prefetch to tensor core fragment by one iteration
+  sch->Annotate(state->tiles[r_indices_[1]].back(), 
tir::attr::software_pipeline_stage,
+                Array<Integer>{0, 0, 1});
+  sch->Annotate(state->tiles[r_indices_[1]].back(), 
tir::attr::software_pipeline_order,
+                Array<Integer>{0, 1, 2});
+  // Outer software pipeline: Interleave the outer loop with the (pipelined) 
inner loop.
+  // The prefetching stage of the inner pipeline is executed by one iteration 
in the outer loop.
+  sch->Annotate(state->tiles[r_indices_[0]].back(), 
tir::attr::software_pipeline_stage,
+                Array<Integer>{0, 0, 0, 0, 0, 1, 1});
+  sch->Annotate(state->tiles[r_indices_[0]].back(), 
tir::attr::software_pipeline_order,
+                Array<Integer>{0, 3, 1, 4, 5, 2, 6});

Review Comment:
   Can we have some visualization of this ordering (like pseudocode)? I 
remember that this ordering made sense to me when I was closely examining the 
generated IR. But I don't remember how it works any more and it is very hard to 
imagine what this ordering means just by looking at the numbers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm] masahi commented on a diff in pull request #12544: [MetaSchedule] Add software pipeline in CUDA tensor core auto tensorization

Reply via email to