sezruby commented on PR #12226:
URL: https://github.com/apache/gluten/pull/12226#issuecomment-4634388224

   > @sezruby do you use lance-spark? Do you know how it manages the memory?
   
   Yes — the [lance-spark](https://github.com/lance-format/lance-spark) 
connector. The relevant entry point is `LanceDataWriter`:
   
   ```java
   try (ArrowArrayStream arrowStream =
           ArrowArrayStream.allocateNew(LanceRuntime.allocator())) {
     Data.exportArrayStream(LanceRuntime.allocator(), bufferRef, arrowStream);
     return Fragment.create(writeOptions.getDatasetUri(), arrowStream, params);
   }
   ```
   
   **lance-spark memory model:**
   
   - **One process-wide JVM `BufferAllocator`** — `LanceRuntime.allocator()` is 
a lazily-initialized `RootAllocator` with size from env `LANCE_ALLOCATOR_SIZE` 
(default `Long.MAX_VALUE`). Global singleton, not per-task.
   - **Spark `TaskMemoryManager` is not involved.** Allocations don't go 
through `acquireExecutionMemory(...)`, so Spark's spill/eviction can't react. 
Per-batch footprint is bounded by the `maxBatchBytes` write option (default 
~64MB), and `try-with-resources` releases boundary buffers as soon as 
`Fragment.create(...)` returns.
   - **JVM ↔ native handoff** is Arrow C-Data only. Vectors backing the 
`VectorSchemaRoot` are exported into the `ArrowArrayStream` struct as raw 
pointers + release callback; the Rust side borrows, JVM owns + releases via 
Arrow's standard release contract. No double-ownership.
   - **Lifecycle**: root allocator lives for the JVM process; child allocators 
per batch close on stream close.
   
   **How that compares to gluten/velox:**
   
   | | lance-spark | gluten/velox |
   |---|---|---|
   | JVM allocator | Process-wide `RootAllocator` singleton | Per-task 
`BufferAllocator` from `ArrowBufferAllocators.contextInstance()` |
   | Spark `MemoryManager` integration | None | Yes — 
`ArrowReservationListener` ↔ `acquireExecutionMemory(...)` |
   | Native allocator | Rust crate's process heap (not coordinated with JVM) | 
Velox `MemoryPool` hierarchy, JNI-bridged back to a JVM `ReservationListener` |
   | Spill | None (OOM = JVM dies) | Native spill to disk, Spark-governed |
   | Memory accounting | Container RSS only | Spark UI metrics + Velox pool 
stats |
   | Off-heap visibility | Invisible to Spark | Threaded through Spark's 
manager |
   
   The difference is intent: lance-spark uses Arrow as a **boundary ABI** 
("write a batch, free it") and never integrates with Spark's memory manager 
because it doesn't need to — it's a connector, not an execution engine. 
gluten/velox is a **full alternative execution engine** that has to play nicely 
with Spark's spill/OOM machinery, so it threads a `ReservationListener` through 
Arrow's `BufferAllocator` parent chain *and* across JNI into Velox's 
`MemoryPool`.
   
   > Actually we have a long term plan to introduce datafusion backend as 
complementary to Velox. Lance may be a good try.
   
   For that direction, lance-spark's allocator model is too lightweight to drop 
in as a Velox-equivalent role — you'd want the Spark memory-manager plumbing on 
the DataFusion side (similar to how gluten wires Velox today) before it could 
serve as a full execution backend. As-is it works well as a Lance dataset 
reader/writer connector but isn't an engine. Worth keeping in mind when scoping 
the DataFusion backend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to