avantgardnerio opened a new pull request, #22834:
URL: https://github.com/apache/datafusion/pull/22834

   ## Which issue does this PR close?
   
   Not closing an issue directly. This is a follow-up to the
   \`MemoryPool\` accounting discussion (#22723, #22758) — demonstrates a
   unilateral, downstream-shippable approach that doesn't require changing
   \`MemoryPool\`'s convention.
   
   ## Rationale for this change
   
   \`MemoryPool\` is voluntary. Operators that don't call \`try_grow\` —
   in-flight batches, \`arrow-row\` decode buffers, Arrow kernel scratch
   space, allocator slab overhead — allocate freely past the declared
   budget. On a process with a hard memory ceiling (k8s pod cgroup,
   ulimit), the result is an OS-level kill that takes down every
   concurrent query, not just the offender.
   
   This PR adds \`OomGuard\`, a \`GlobalAlloc\` wrapper that catches that
   case at the allocator layer:
   
   1. Every Rust allocation is debited from a global \`BALANCE\` atomic.
   2. A dedicated OS thread polls jemalloc's \`stats.resident\` every 10ms
      and syncs \`BALANCE = threshold - resident\`, catching slab pages
      that the kernel sees but \`GlobalAlloc\` doesn't.
   3. When \`BALANCE\` goes negative on a stamped query worker, the
      wrapper returns \`NULL\` from \`alloc\`.
   4. An \`alloc_error_hook\` originates \`panic_any(OomGuardPanic)\` from
      a safe-Rust frame outside the allocator.
   5. The panic unwinds through \`df.collect()\`, gets caught in the
      example, and is converted to \`Status::resource_exhausted\`. The
      query dies; the server survives.
   
   A/B demo on the same 1GiB cgroup:
   - \`OOM_GUARD=off\`: client sees broken pipe, server SIGKILLed.
   - \`OOM_GUARD\` on (default): client sees \`ResourceExhausted\`,
     follow-up \`SELECT 1\` against the same server returns immediately.
   
   ## What changes are included in this PR?
   
   Everything is scoped to \`datafusion-examples/examples/flight/\`:
   
   - \`oom_guard.rs\` (new) — the \`GlobalAlloc\` wrapper, per-thread
     drift settling, jemalloc-stats-based poll on a \`std::thread\`,
     cgroup-v2/v1 memory-limit discovery.
   - \`sql_server.rs\` — wraps the global allocator with
     \`OomGuard<Jemalloc>\`, arms via \`OOM_GUARD\` env var with
     configurable headroom (\`OOM_GUARD_HEADROOM\`, default 5%), wraps
     \`df.collect()\` in \`catch_unwind\` to convert \`OomGuardPanic\` to
     a \`Status::resource_exhausted\` response.
   - \`main.rs\` — installs the \`alloc_error_hook\`, installs a global
     panic hook that logs \`OomGuardPanic\` distinctly (no backtrace
     capture: it allocates during symbol resolution at exactly the moment
     we're at the memory cliff).
   - \`Cargo.toml\` — drops \`mimalloc\`, adds \`tikv-jemallocator\` and
     \`tikv-jemalloc-ctl\`.
   - \`.cargo/config.toml\` (new) — sets \`RUSTC_BOOTSTRAP=1\` so
     \`#![feature(alloc_error_hook)]\` (tracking rust-lang/rust#51245)
     builds on stable.
   - \`README.md\` (new) — A/B walkthrough.
   
   ## Are these changes tested?
   
   Manually via the A/B in the README on a Linux dev box with
   \`systemd-run --user --scope -p MemoryMax=1G\`. No automated tests
   added: the example's behavior depends on a cgroup limit being in
   effect, which doesn't fit the existing test harness.
   
   ## Are there any user-facing changes?
   
   Only to the \`flight\` example. No core / library API surface changed.
   
   ## Caveats
   
   - \`alloc_error_hook\` is currently unstable. The workspace
     \`.cargo/config.toml\` sets \`RUSTC_BOOTSTRAP=1\` to compile it on a
     stable toolchain — same workaround that downstream forks
     (\`datafusion-comet\`, others) use.
   - The poll currently reads jemalloc-specific stats. Making it
     allocator-agnostic by reading cgroup's \`memory.current\` instead is
     a straightforward follow-up if there's interest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to