avantgardnerio opened a new issue, #22723:
URL: https://github.com/apache/datafusion/issues/22723

   ### Describe the bug
   
   `NestedLoopJoinExec`'s memory-limited (spill) path allocates more memory 
than it reserves through the `MemoryPool`. The accounting-pool framework added 
in #22626 surfaces this if `HEADROOM_FACTOR` in the SLT pool is tightened — see 
#22721, which lowers it from 8.0 → 5.0 and trips this test.
   
   ### To Reproduce
   
   In `datafusion/sqllogictest/src/accounting_pool.rs`, set:
   
   \`\`\`rust
   const HEADROOM_FACTOR: f64 = 5.0;
   \`\`\`
   
   Then run `nested_loop_join_spill.slt`. First query at line 33:
   
   \`\`\`sql
   SET datafusion.execution.target_partitions = 1;
   SET datafusion.runtime.memory_limit = '150K';
   
   SELECT count(*) as cnt, min(v1) as mn, max(v1) as mx
   FROM generate_series(1, 100000) AS t1(v1)
   INNER JOIN generate_series(1, 1) AS t2(v2)
     ON (t1.v1 + t2.v2) > 0;
   \`\`\`
   
   ### Expected behavior
   
   Query succeeds with allocator usage bounded by the configured `memory_limit` 
(×~10% slop for legitimate untracked overhead like Tokio/Rayon thread state).
   
   ### Actual behavior
   
   \`\`\`
   External error: 1 errors in file 
datafusion/sqllogictest/test_files/nested_loop_join_spill.slt
   1. query failed: Other Error: allocator overdraft: account balance at panic 
= -20245 bytes
      at nested_loop_join_spill.slt:33
   \`\`\`
   
   Peak allocator usage reaches ~770KB against a declared 150KB pool — 5.13× 
the budget. The first ~750KB (5×) is absorbed by the headroom factor; the 
additional 20245 bytes is allocator usage that was never reserved through 
`MemoryPool::try_grow`.
   
   ### Likely root cause
   
   Spill path in `NestedLoopJoinExec` has at least these untracked allocation 
sites (need verification with targeted instrumentation):
   
   - `generate_next_batch` buffering — `RecordBatch`es accumulated for the 
build side before the spill decision
   - `concat_batches` at the spill boundary — copies into a coalesced batch 
before the IPC writer sees it
   - `take_native` during the probe phase — gather kernels allocate output 
buffers
   - IPC reader path on spill re-read — the `StreamReader`/`FileReader` decoder 
owns buffers that aren't accounted
   
   The first query in the test stays small (100K × i32 = ~400KB build side), so 
the overshoot is whatever lives off-pool in the spill setup, not the bulk data 
itself.
   
   ### Additional context
   
   - Surfaced by the allocator-vs-pool tracker added in #22626.
   - Demonstration PR: #22721 (test-only: lowers `HEADROOM_FACTOR`; expected to 
fail until this is fixed).
   - Fix should reserve memory through `MemoryReservation` before each of the 
sites above, not bump the headroom constant — the constant is a leak detector, 
not a budget.
   
   ### Component
   
   - [x] Physical Plan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to