andygrove opened a new pull request, #4582:
URL: https://github.com/apache/datafusion-comet/pull/4582
## Which issue does this PR close?
Relates to #4576. This is an exploratory prototype of the RSS circuit
breaker ("OomGuard") half of that issue, not a complete implementation, so it
does not close it.
## Rationale for this change
Comet's memory accounting relies on voluntary `MemoryPool` reservations,
which miss allocations made by Arrow buffers, join scratch space, and
expression kernels. When real native memory exceeds the container limit, the
OS/YARN/Kubernetes kills the entire executor JVM, losing every task and all
cached data on it.
This prototype adds an executor-global, allocator-level circuit breaker. It
tracks the real bytes the global allocator hands out and, when an armed,
over-budget condition is detected on a query-worker thread, fails that single
task with a retriable `ResourcesExhausted` error instead of letting the
executor get OOM-killed. The approach adapts the `AccountingAllocator` from
apache/datafusion#22626 (the byte-tracking allocator wrapper) rather than
depending on it, since that code lives in DataFusion's test-only `sqllogictest`
crate.
## What changes are included in this PR?
Gated behind a new `oom-guard` cargo feature; the default build is unchanged
with zero added per-allocation overhead.
- `native/core/src/execution/memory_pools/oom_guard.rs` (new):
`AccountingAllocator<A>` wrapping the inner global allocator; a single
process-wide balance with per-thread drift settled at a 64 KiB threshold;
`arm`/`disarm`/`stamp_current_thread`/`current_balance`; a typed
`OomGuardPanic`, raised via `panic_any` on an armed, stamped thread that
crosses the limit, with reentrancy protection so the panic's own boxing
allocation does not recurse.
- `native/core/src/lib.rs`: under the `oom-guard` feature, installs the
wrapper as `#[global_allocator]` over jemalloc / mimalloc / system; mutually
exclusive cfgs leave the default build untouched.
- `native/core/src/execution/jni_api.rs`: stamps tokio worker threads
(`on_thread_start`) and the JNI caller thread; arms the guard from config in
`createPlan`; maps `OomGuardPanic` to `DataFusionError::ResourcesExhausted` at
both execution boundaries (the spawned/channel path, both producer and
consumer, and the busy-poll `block_on` path).
- `spark/src/main/scala/org/apache/comet/CometConf.scala`: registers
`spark.comet.exec.memoryGuard.enabled` (default false) and
`spark.comet.exec.memoryGuard.size` (optional; defaults to the executor
off-heap size).
Known limitations / out of scope for this prototype (candidates for
follow-ups):
- Executor-global granularity only; no per-task attribution or fairness.
- Only tokio workers and the JNI caller thread are stamped, so allocations
on `spawn_blocking`/IO/other pools are tracked but cannot themselves trip the
breaker.
- Layout-byte accounting only; no periodic resync to real jemalloc/mimalloc
resident stats.
- Does not feed the live balance back into DataFusion's `MemoryPool` (the
"online accounting" half of #4576).
- Process-global, last-writer-wins limit, armed per `createPlan`.
Note: commits carry `[skip ci]` intentionally while this is an early
prototype.
## How are these changes tested?
- Rust unit tests in `oom_guard.rs` cover the decision/settle helpers, and
that the breaker trips only on an armed, stamped thread (disarmed never trips,
unstamped never trips).
- An end-to-end Rust test drives a real 64 MiB heap allocation through the
installed `AccountingAllocator` and asserts an `OomGuardPanic` is raised and
caught.
- Verified the build and `clippy -D warnings` across the default,
`oom-guard`, and `jemalloc,oom-guard` feature combinations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]