apstenku123 opened a new pull request, #19504:
URL: https://github.com/apache/tvm/pull/19504

   ## Summary
   
   Adds an opt-in environment variable `TVM_METAL_STORAGE_MODE` that lets users 
allocate device data buffers as `MTLResourceStorageModeShared` (or `Managed`) 
instead of the default `MTLResourceStorageModePrivate`. Default behaviour is 
unchanged.
   
   | value             | mode                                    | semantics    
                                          |
   | ----------------- | --------------------------------------- | 
------------------------------------------------------ |
   | unset / `private` | `MTLResourceStorageModePrivate`         | default, 
GPU-only, preserves historical behaviour      |
   | `shared`          | `MTLResourceStorageModeShared`          | CPU+GPU 
mapped — required for zero-copy DLPack to MLX  |
   | `managed`         | `MTLResourceStorageModeManaged`         | macOS-only 
intermediate (driver tracks dirty pages)    |
   | anything else     | `MTLResourceStorageModePrivate` + warn  | safe 
fall-back                                         |
   
   The env var is read once on first `MetalWorkspace::AllocDataSpace` and 
cached for the lifetime of the process; no per-allocation overhead. A new FFI 
helper `metal.GetStorageMode` is registered alongside the existing 
`metal.GetProfileCounters` / `metal.ResetProfileCounters` helpers so tests can 
verify the resolved mode without an ObjC bridge.
   
   The staging-buffer pool (`metal_common.h:383`) and temp-buffer pool 
(`metal_device_api.mm:374`) already use `MTLStorageModeShared` and are 
intentionally untouched — they're host-staging by design and don't fall under 
the data-space allocator.
   
   ## Why
   
   TVM's Metal device API has always allocated `MTLBuffer` with 
`MTLResourceStorageModePrivate`. This is the right choice for pure-GPU 
workloads (no CPU page mapping), but it blocks zero-copy DLPack interop with 
other Metal-using frameworks that allocate Shared/Managed buffers — notably 
`ml-explore/mlx`, which uses `MTLResourceStorageModeShared` everywhere. Two 
allocators on the same `MTLDevice` produce buffers with different page-mapping 
semantics; DLPack capsules from TVM cannot be consumed by `mx.array` 
(live-tested: `std::bad_cast` on `mx.array(tvm_metal_capsule)`).
   
   This change unblocks the bridge from TVM-NDArray to `mlx.array` (both wrap 
`MTLBuffer`; require matching storage mode for the same foreign capsule to be 
consumable). It is the producer half of a pair; the consumer half is a parallel 
ml-explore/mlx PR that adds `mx.from_dlpack(obj)`.
   
   ## Test plan
   
   - [ ] `xcrun --sdk macosx clang++ -std=c++17 -framework Metal 
syntax_check.mm -o syntax_check && ./syntax_check` — exercises env-var parsing 
for all 6 cases (unset, shared, mixed-case Shared, invalid, managed, private).
   - [ ] Build runtime: `mkdir build && cd build && cmake -DUSE_METAL=ON 
-DUSE_LLVM=ON -DCMAKE_BUILD_TYPE=Release .. && make -j tvm_runtime`
   - [ ] `./runtime_check` (TVM-linked probe) — validates that the env var 
flows to a real `MTLBuffer.storageMode`. Live captured 2026-05-03 on Apple M4 
Max for unset/shared/managed/private.
   - [ ] `TVM_METAL_STORAGE_MODE=shared python -c "import tvm; arr = 
tvm.nd.empty((4,), dtype='float32', device=tvm.metal()); print(arr.shape)"`
   - [ ] CI: macos-arm64 runner in apache/tvm should exercise the existing 
Metal tests; default behaviour (env unset) is unchanged.
   
   ## Caveats / non-goals
   
   - This is a **copy-elision interop patch**, not a kernel-speed patch. 
Default Private mode remains the right choice for TVM-only workloads.
   - The patch artifact only changes `src/runtime/metal/metal_device_api.mm`; 
it does not yet add an upstream `tests/python/runtime/...` file. A 
subprocess-isolated Python test for the env-cache behaviour can be folded in if 
maintainers want it in tree.
   - Local Metal microbenchmarks on Apple M4 Max show Shared buffers remove the 
staging-buffer + blit/wait cost at CPU↔Metal transfer boundaries (e.g., 1 MiB 
CPU→Metal median 138.375 µs Private vs 12.750 µs Shared in a downstream probe). 
These numbers are local-health checks, not in-tree benchmarks.
   
   ## Pairing
   
   Paired upstream patch: ml-explore/mlx adds `mx.from_dlpack(obj)` Metal-aware 
consumer (filed in parallel). Both patches must land for the zero-copy MLX↔TVM 
use case to work end-to-end.
   
   ## Attribution
   
   Co-developed with `cppmega.mlx` for Apple-Silicon Metal interop with MLX.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to