avantgardnerio opened a new pull request, #22834:
URL: https://github.com/apache/datafusion/pull/22834
## Which issue does this PR close?
Not closing an issue directly. This is a follow-up to the
\`MemoryPool\` accounting discussion (#22723, #22758) — demonstrates a
unilateral, downstream-shippable approach that doesn't require changing
\`MemoryPool\`'s convention.
## Rationale for this change
\`MemoryPool\` is voluntary. Operators that don't call \`try_grow\` —
in-flight batches, \`arrow-row\` decode buffers, Arrow kernel scratch
space, allocator slab overhead — allocate freely past the declared
budget. On a process with a hard memory ceiling (k8s pod cgroup,
ulimit), the result is an OS-level kill that takes down every
concurrent query, not just the offender.
This PR adds \`OomGuard\`, a \`GlobalAlloc\` wrapper that catches that
case at the allocator layer:
1. Every Rust allocation is debited from a global \`BALANCE\` atomic.
2. A dedicated OS thread polls jemalloc's \`stats.resident\` every 10ms
and syncs \`BALANCE = threshold - resident\`, catching slab pages
that the kernel sees but \`GlobalAlloc\` doesn't.
3. When \`BALANCE\` goes negative on a stamped query worker, the
wrapper returns \`NULL\` from \`alloc\`.
4. An \`alloc_error_hook\` originates \`panic_any(OomGuardPanic)\` from
a safe-Rust frame outside the allocator.
5. The panic unwinds through \`df.collect()\`, gets caught in the
example, and is converted to \`Status::resource_exhausted\`. The
query dies; the server survives.
A/B demo on the same 1GiB cgroup:
- \`OOM_GUARD=off\`: client sees broken pipe, server SIGKILLed.
- \`OOM_GUARD\` on (default): client sees \`ResourceExhausted\`,
follow-up \`SELECT 1\` against the same server returns immediately.
## What changes are included in this PR?
Everything is scoped to \`datafusion-examples/examples/flight/\`:
- \`oom_guard.rs\` (new) — the \`GlobalAlloc\` wrapper, per-thread
drift settling, jemalloc-stats-based poll on a \`std::thread\`,
cgroup-v2/v1 memory-limit discovery.
- \`sql_server.rs\` — wraps the global allocator with
\`OomGuard<Jemalloc>\`, arms via \`OOM_GUARD\` env var with
configurable headroom (\`OOM_GUARD_HEADROOM\`, default 5%), wraps
\`df.collect()\` in \`catch_unwind\` to convert \`OomGuardPanic\` to
a \`Status::resource_exhausted\` response.
- \`main.rs\` — installs the \`alloc_error_hook\`, installs a global
panic hook that logs \`OomGuardPanic\` distinctly (no backtrace
capture: it allocates during symbol resolution at exactly the moment
we're at the memory cliff).
- \`Cargo.toml\` — drops \`mimalloc\`, adds \`tikv-jemallocator\` and
\`tikv-jemalloc-ctl\`.
- \`.cargo/config.toml\` (new) — sets \`RUSTC_BOOTSTRAP=1\` so
\`#![feature(alloc_error_hook)]\` (tracking rust-lang/rust#51245)
builds on stable.
- \`README.md\` (new) — A/B walkthrough.
## Are these changes tested?
Manually via the A/B in the README on a Linux dev box with
\`systemd-run --user --scope -p MemoryMax=1G\`. No automated tests
added: the example's behavior depends on a cgroup limit being in
effect, which doesn't fit the existing test harness.
## Are there any user-facing changes?
Only to the \`flight\` example. No core / library API surface changed.
## Caveats
- \`alloc_error_hook\` is currently unstable. The workspace
\`.cargo/config.toml\` sets \`RUSTC_BOOTSTRAP=1\` to compile it on a
stable toolchain — same workaround that downstream forks
(\`datafusion-comet\`, others) use.
- The poll currently reads jemalloc-specific stats. Making it
allocator-agnostic by reading cgroup's \`memory.current\` instead is
a straightforward follow-up if there's interest.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]