Re: [PR] [Metal] Batched command dispatch and staging buffer pool [tvm]

via GitHub Thu, 05 Mar 2026 17:07:12 -0800


mitiskuma commented on code in PR #18877:
URL: https://github.com/apache/tvm/pull/18877#discussion_r2893181418



##########
src/runtime/metal/metal_common.h:
##########
@@ -103,13 +103,34 @@ class AutoReleasePoolWrapper {
 };
 
 /*!
- * \brief Structure for error handling in queues
+ * \brief Metal command stream with batched dispatch support.
+ *
+ * Compute dispatches are batched into a single command buffer via
+ * GetPendingComputeEncoder(). Blit operations (copies) are interleaved
+ * on the same command buffer via GetBlitEncoderOnPendingBuffer().
+ * The command buffer is committed when FlushCommandBuffer() is called.
+ *
+ * Must call FlushCommandBuffer() before:
+ * - GPU→CPU readback (need data in CPU memory)
+ * - Buffer deallocation (FreeDataSpace, setPurgeableState:Empty on
+ *   a buffer referenced by an uncommitted CB crashes Metal)
+ * - Stream sync (StreamSync / Synchronize)
  */
 class Stream {
  public:
   explicit Stream(id<MTLDevice> device) { queue_ = [device newCommandQueue]; }
-  ~Stream() { [queue_ release]; }
-  id<MTLCommandBuffer> GetCommandBuffer(std::string label = "", bool 
attach_error_callback = true) {
+  ~Stream() {
+    FlushCommandBuffer();
+    [queue_ release];
+  }

Review Comment:
   The Stream destructor is only called during MetalWorkspace teardown (process 
exit or ReinitializeDefaultStreams). At that point no GPU work is in flight. 
FlushCommandBuffer commits the CB but we do not need to wait for completion 
because the process is tearing down and the completion handler captures `this` 
which is about to be freed regardless. Using Synchronize() here would block the 
main thread on GPU idle for no reason. The original code (before this PR) also 
did not wait on completion during destruction.



##########
src/runtime/metal/metal_common.h:
##########
@@ -133,8 +247,42 @@ class Stream {
   const std::string& ErrorDescription() const { return error_description_; }
 
  private:
+  /*! \brief Get or create the pending command buffer (shared by compute and 
blit). */
+  id<MTLCommandBuffer> GetOrCreatePendingCommandBuffer() {
+    if (pending_command_buffer_ == nil) {
+      pending_command_buffer_ = [[queue_ commandBuffer] retain];
+      pending_command_buffer_.label = @"TVMBatched";
+      [pending_command_buffer_ addCompletedHandler:^(id<MTLCommandBuffer> 
buffer) {
+        if (buffer.status == MTLCommandBufferStatusError) {
+          TVM_FFI_ICHECK(buffer.error != nil);

Review Comment:
   TVM_FFI_ICHECK is the standard error reporting mechanism used throughout the 
TVM Metal runtime. The pre-existing code (GetCommandBuffer, which this PR does 
not change) already uses TVM_FFI_ICHECK in the same completion handler pattern. 
This is not a regression introduced by this PR.



##########
src/runtime/metal/metal_common.h:
##########
@@ -201,22 +349,65 @@ class MetalThreadEntry {
   Device device;
   /*! \brief The current stream */
   std::vector<TVMStreamHandle> stream;
-  /*! \brief The shared buffer used for copy. */
+  /*! \brief The shared buffer used for GPU→CPU readback. */
   std::vector<id<MTLBuffer>> temp_buffer_;
+  /*!
+   * \brief Pool of staging buffers for CPU→GPU copies that are inlined
+   * into the pending command buffer. Each inlined copy needs its own
+   * staging buffer because the GPU reads them asynchronously.
+   * Buffers are recycled after FlushCommandBuffer()/Synchronize().
+   */
+  struct StagingBufferPool {
+    struct Entry {
+      id<MTLBuffer> buffer = nil;
+      size_t size = 0;
+    };
+    std::vector<Entry> pool;
+    size_t next_index = 0;  // round-robin within current batch
+
+    id<MTLBuffer> GetOrCreate(id<MTLDevice> dev, size_t nbytes) {
+      if (next_index < pool.size() && pool[next_index].size >= nbytes) {
+        return pool[next_index++].buffer;
+      }
+      // Need a new or bigger buffer at this index
+      if (next_index < pool.size() && pool[next_index].buffer != nil) {
+        [pool[next_index].buffer release];
+      }
+      if (next_index >= pool.size()) {
+        pool.push_back({nil, 0});
+      }
+      pool[next_index].buffer = [dev newBufferWithLength:nbytes 
options:MTLStorageModeShared];
+      pool[next_index].size = nbytes;
+      return pool[next_index++].buffer;

Review Comment:
   Fair point. Will add a nil check with ICHECK. Note that AllocDataSpace (the 
main allocation path) also does not check for nil beyond a simple ICHECK, so 
this is consistent with the existing codebase pattern, but adding the check is 
cheap and reasonable.



##########
src/runtime/metal/metal_common.h:
##########
@@ -201,22 +349,65 @@ class MetalThreadEntry {
   Device device;
   /*! \brief The current stream */
   std::vector<TVMStreamHandle> stream;
-  /*! \brief The shared buffer used for copy. */
+  /*! \brief The shared buffer used for GPU→CPU readback. */
   std::vector<id<MTLBuffer>> temp_buffer_;
+  /*!
+   * \brief Pool of staging buffers for CPU→GPU copies that are inlined
+   * into the pending command buffer. Each inlined copy needs its own
+   * staging buffer because the GPU reads them asynchronously.
+   * Buffers are recycled after FlushCommandBuffer()/Synchronize().
+   */
+  struct StagingBufferPool {
+    struct Entry {
+      id<MTLBuffer> buffer = nil;
+      size_t size = 0;
+    };
+    std::vector<Entry> pool;
+    size_t next_index = 0;  // round-robin within current batch

Review Comment:
   The pool does not grow unboundedly. next_index is reset to 0 on every 
StreamSync call, which happens at least once per token (GPU->CPU readback). In 
practice the pool stabilizes at ~20 entries (one per CPU->GPU copy between 
syncs) and never grows beyond that. The "round-robin" comment is imprecise, 
will fix to "sequential within current batch".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Metal] Batched command dispatch and staging buffer pool [tvm]

Reply via email to