andygrove opened a new pull request, #4582:
URL: https://github.com/apache/datafusion-comet/pull/4582

   ## Which issue does this PR close?
   
   Relates to #4576. This is an exploratory prototype of the RSS circuit 
breaker ("OomGuard") half of that issue, not a complete implementation, so it 
does not close it.
   
   ## Rationale for this change
   
   Comet's memory accounting relies on voluntary `MemoryPool` reservations, 
which miss allocations made by Arrow buffers, join scratch space, and 
expression kernels. When real native memory exceeds the container limit, the 
OS/YARN/Kubernetes kills the entire executor JVM, losing every task and all 
cached data on it.
   
   This prototype adds an executor-global, allocator-level circuit breaker. It 
tracks the real bytes the global allocator hands out and, when an armed, 
over-budget condition is detected on a query-worker thread, fails that single 
task with a retriable `ResourcesExhausted` error instead of letting the 
executor get OOM-killed. The approach adapts the `AccountingAllocator` from 
apache/datafusion#22626 (the byte-tracking allocator wrapper) rather than 
depending on it, since that code lives in DataFusion's test-only `sqllogictest` 
crate.
   
   ## What changes are included in this PR?
   
   Gated behind a new `oom-guard` cargo feature; the default build is unchanged 
with zero added per-allocation overhead.
   
   - `native/core/src/execution/memory_pools/oom_guard.rs` (new): 
`AccountingAllocator<A>` wrapping the inner global allocator; a single 
process-wide balance with per-thread drift settled at a 64 KiB threshold; 
`arm`/`disarm`/`stamp_current_thread`/`current_balance`; a typed 
`OomGuardPanic`, raised via `panic_any` on an armed, stamped thread that 
crosses the limit, with reentrancy protection so the panic's own boxing 
allocation does not recurse.
   - `native/core/src/lib.rs`: under the `oom-guard` feature, installs the 
wrapper as `#[global_allocator]` over jemalloc / mimalloc / system; mutually 
exclusive cfgs leave the default build untouched.
   - `native/core/src/execution/jni_api.rs`: stamps tokio worker threads 
(`on_thread_start`) and the JNI caller thread; arms the guard from config in 
`createPlan`; maps `OomGuardPanic` to `DataFusionError::ResourcesExhausted` at 
both execution boundaries (the spawned/channel path, both producer and 
consumer, and the busy-poll `block_on` path).
   - `spark/src/main/scala/org/apache/comet/CometConf.scala`: registers 
`spark.comet.exec.memoryGuard.enabled` (default false) and 
`spark.comet.exec.memoryGuard.size` (optional; defaults to the executor 
off-heap size).
   
   Known limitations / out of scope for this prototype (candidates for 
follow-ups):
   - Executor-global granularity only; no per-task attribution or fairness.
   - Only tokio workers and the JNI caller thread are stamped, so allocations 
on `spawn_blocking`/IO/other pools are tracked but cannot themselves trip the 
breaker.
   - Layout-byte accounting only; no periodic resync to real jemalloc/mimalloc 
resident stats.
   - Does not feed the live balance back into DataFusion's `MemoryPool` (the 
"online accounting" half of #4576).
   - Process-global, last-writer-wins limit, armed per `createPlan`.
   
   Note: commits carry `[skip ci]` intentionally while this is an early 
prototype.
   
   ## How are these changes tested?
   
   - Rust unit tests in `oom_guard.rs` cover the decision/settle helpers, and 
that the breaker trips only on an armed, stamped thread (disarmed never trips, 
unstamped never trips).
   - An end-to-end Rust test drives a real 64 MiB heap allocation through the 
installed `AccountingAllocator` and asserts an `OomGuardPanic` is raised and 
caught.
   - Verified the build and `clippy -D warnings` across the default, 
`oom-guard`, and `jemalloc,oom-guard` feature combinations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to