Re: [I] NestedLoopJoinExec spill path: untracked allocation overshoots memory pool [datafusion]

via GitHub Thu, 04 Jun 2026 05:00:58 -0700


avantgardnerio commented on issue #22723:
URL: https://github.com/apache/datafusion/issues/22723#issuecomment-4621928547


   Extreme Programming breaks everything down into:
   1. values: "we have limited resources, so we should manage them wisely"
   2. principles: "so we use X test suite to find untracked memory"
   3. practices: "so we can fix OOM bug #22739 "
   
   IME, people almost always agree on 1 & 3. The principles are usually what 
takes time to come agreement about, so it's good and natural to be discussing 
what the optimal way to do this is.
   
   So to avoid getting too abstract, I'll raise a specific grounded question 
about not doing SLTs:
   
   When our customer runs a query like:
   
   ```
   source logs(<team>)
     | filter <priority predicate>
           && \$d.message != null
     | groupby \$d.message agg count(1) as \$d.occurrences
     | orderby \$d.occurrences desc
     | limit 15
   ```
   
   If we have memory based SLTs in place, we can translate that into:
   
   ```
   statement ok
   CREATE TABLE utf8_keys AS
   SELECT cast(v AS varchar) || repeat('x', 200) AS k,
          make_array(1) AS _force_rows
   FROM generate_series(1, 50000) AS t(v)
   
   statement ok
   SET datafusion.runtime.memory_limit = '1M'
   
   query I nosort
   SELECT count(*) FROM (
     SELECT k, _force_rows, count(*) AS c
     FROM utf8_keys
     GROUP BY k, _force_rows
   )
   ----
   50000
   ```
   
   And see an error like:
   
   ```
   1. query failed: Other Error: allocator overdraft: account balance at panic 
= -1384887 bytes
   [SQL] SELECT count(*) FROM (
     SELECT k, _force_rows, count(*) AS c
     FROM utf8_keys
     GROUP BY k, _force_rows
   )
   at 
/__w/datafusion/datafusion/datafusion/sqllogictest/test_files/group_by_spill_row_decode.slt:49
   ```
   
   Then make [a PR to fix the 
error](https://github.com/apache/datafusion/pull/22741) , which stays in the 
regression suite forever.
   
   So my question is: if we adopt another approach - like 
[SqlFuzz](https://github.com/andygrove/sqlfuzz), what would the workflow look 
like to do the same?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] NestedLoopJoinExec spill path: untracked allocation overshoots memory pool [datafusion]

Reply via email to