Re: [PR] [Metal] Batched command dispatch and staging buffer pool [tvm]

via GitHub Thu, 05 Mar 2026 17:01:51 -0800


gemini-code-assist[bot] commented on code in PR #18877:
URL: https://github.com/apache/tvm/pull/18877#discussion_r2893170186



##########
src/runtime/metal/metal_common.h:
##########
@@ -103,13 +103,34 @@ class AutoReleasePoolWrapper {
 };
 
 /*!
- * \brief Structure for error handling in queues
+ * \brief Metal command stream with batched dispatch support.
+ *
+ * Compute dispatches are batched into a single command buffer via
+ * GetPendingComputeEncoder(). Blit operations (copies) are interleaved
+ * on the same command buffer via GetBlitEncoderOnPendingBuffer().
+ * The command buffer is committed when FlushCommandBuffer() is called.
+ *
+ * Must call FlushCommandBuffer() before:
+ * - GPU→CPU readback (need data in CPU memory)
+ * - Buffer deallocation (FreeDataSpace, setPurgeableState:Empty on
+ *   a buffer referenced by an uncommitted CB crashes Metal)
+ * - Stream sync (StreamSync / Synchronize)
  */
 class Stream {
  public:
   explicit Stream(id<MTLDevice> device) { queue_ = [device newCommandQueue]; }
-  ~Stream() { [queue_ release]; }
-  id<MTLCommandBuffer> GetCommandBuffer(std::string label = "", bool 
attach_error_callback = true) {
+  ~Stream() {
+    FlushCommandBuffer();
+    [queue_ release];
+  }

Review Comment:
   
![security-high](https://www.gstatic.com/codereviewagent/security-high-priority.svg)
 ![high](https://www.gstatic.com/codereviewagent/high-priority.svg)
   
   A critical Use-After-Free (UAF) vulnerability exists in the `Stream` class 
destructor. The `Stream` destructor flushes the pending command buffer but does 
not wait for its completion. This allows the `addCompletedHandler` callback, 
which captures `this`, to access a deleted `Stream` object if the stream is 
destroyed while work is pending, leading to a process crash. To prevent this 
UAF, the destructor must ensure all pending work is completed before 
destruction. Replacing `FlushCommandBuffer()` with `Synchronize()` will resolve 
this.
   
   ```c
     ~Stream() {
       Synchronize();
       [queue_ release];
     }
   ```



##########
src/runtime/metal/metal_common.h:
##########
@@ -133,8 +247,42 @@ class Stream {
   const std::string& ErrorDescription() const { return error_description_; }
 
  private:
+  /*! \brief Get or create the pending command buffer (shared by compute and 
blit). */
+  id<MTLCommandBuffer> GetOrCreatePendingCommandBuffer() {
+    if (pending_command_buffer_ == nil) {
+      pending_command_buffer_ = [[queue_ commandBuffer] retain];
+      pending_command_buffer_.label = @"TVMBatched";
+      [pending_command_buffer_ addCompletedHandler:^(id<MTLCommandBuffer> 
buffer) {
+        if (buffer.status == MTLCommandBufferStatusError) {
+          TVM_FFI_ICHECK(buffer.error != nil);

Review Comment:
   
![security-medium](https://www.gstatic.com/codereviewagent/security-medium-priority.svg)
 ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The `completedHandler` uses `TVM_FFI_ICHECK`, which throws an exception if 
the condition fails. Because this handler is executed on a background thread 
owned by the Metal runtime, an unhandled exception will likely cause the entire 
process to terminate abruptly (via `std::terminate`), as there is typically no 
exception handling logic on the stack of these background threads.



##########
src/runtime/metal/metal_common.h:
##########
@@ -201,22 +349,65 @@ class MetalThreadEntry {
   Device device;
   /*! \brief The current stream */
   std::vector<TVMStreamHandle> stream;
-  /*! \brief The shared buffer used for copy. */
+  /*! \brief The shared buffer used for GPU→CPU readback. */
   std::vector<id<MTLBuffer>> temp_buffer_;
+  /*!
+   * \brief Pool of staging buffers for CPU→GPU copies that are inlined
+   * into the pending command buffer. Each inlined copy needs its own
+   * staging buffer because the GPU reads them asynchronously.
+   * Buffers are recycled after FlushCommandBuffer()/Synchronize().
+   */
+  struct StagingBufferPool {
+    struct Entry {
+      id<MTLBuffer> buffer = nil;
+      size_t size = 0;
+    };
+    std::vector<Entry> pool;
+    size_t next_index = 0;  // round-robin within current batch
+
+    id<MTLBuffer> GetOrCreate(id<MTLDevice> dev, size_t nbytes) {
+      if (next_index < pool.size() && pool[next_index].size >= nbytes) {
+        return pool[next_index++].buffer;
+      }
+      // Need a new or bigger buffer at this index
+      if (next_index < pool.size() && pool[next_index].buffer != nil) {
+        [pool[next_index].buffer release];
+      }
+      if (next_index >= pool.size()) {
+        pool.push_back({nil, 0});
+      }
+      pool[next_index].buffer = [dev newBufferWithLength:nbytes 
options:MTLStorageModeShared];
+      pool[next_index].size = nbytes;
+      return pool[next_index++].buffer;

Review Comment:
   
![security-medium](https://www.gstatic.com/codereviewagent/security-medium-priority.svg)
 ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   In `StagingBufferPool::GetOrCreate`, the call to 
`newBufferWithLength:options:` can return `nil` if the GPU memory allocation 
fails (e.g., due to resource exhaustion). The code does not check for `nil` 
before returning the buffer. The caller in `metal_device_api.mm` immediately 
uses the returned buffer in a `memcpy` operation: `memcpy([staging contents], 
...)`. If `staging` is `nil`, `[staging contents]` will return `NULL`, leading 
to a null pointer dereference and a crash.
   
   ```c
         id<MTLBuffer> buf = [dev newBufferWithLength:nbytes 
options:MTLStorageModeShared];
         TVM_FFI_ICHECK(buf != nil) << "Failed to allocate staging buffer of 
size " << nbytes;
         pool[next_index].buffer = buf;
   ```



##########
src/runtime/metal/metal_common.h:
##########
@@ -201,22 +349,65 @@ class MetalThreadEntry {
   Device device;
   /*! \brief The current stream */
   std::vector<TVMStreamHandle> stream;
-  /*! \brief The shared buffer used for copy. */
+  /*! \brief The shared buffer used for GPU→CPU readback. */
   std::vector<id<MTLBuffer>> temp_buffer_;
+  /*!
+   * \brief Pool of staging buffers for CPU→GPU copies that are inlined
+   * into the pending command buffer. Each inlined copy needs its own
+   * staging buffer because the GPU reads them asynchronously.
+   * Buffers are recycled after FlushCommandBuffer()/Synchronize().
+   */
+  struct StagingBufferPool {
+    struct Entry {
+      id<MTLBuffer> buffer = nil;
+      size_t size = 0;
+    };
+    std::vector<Entry> pool;
+    size_t next_index = 0;  // round-robin within current batch

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The `StagingBufferPool` implementation has a potential for unbounded memory 
growth. The `next_index` is only incremented and never wraps around, and it's 
only reset to 0 in `StreamSync`. If `StreamSync` is not called periodically, 
the `pool` vector will grow indefinitely with each CPU->GPU copy, leading to a 
memory leak.
   
   Additionally, the comment on line 366 `// round-robin within current batch` 
is misleading, as the allocation is linear, not round-robin.
   
   A more robust implementation might use a fixed-size ring buffer and a 
mechanism (like fences or command buffer completion handlers) to recycle 
buffers, or block if no buffers are available.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Metal] Batched command dispatch and staging buffer pool [tvm]

Reply via email to