adriangb opened a new pull request, #23058:
URL: https://github.com/apache/datafusion/pull/23058

   ## Which issue does this PR close?
   
   - Closes #23056.
   
   ## Rationale for this change
   
   `Unparser::expr_to_sql` overflows the OS stack on deeply nested expressions, 
and — surprisingly — it does so **even with the `recursive_protection` feature 
enabled** (the default for the `datafusion` crate).
   
   While investigating I found the root cause is subtler than "a function is 
missing `#[recursive]`":
   
   `expr_to_sql_inner` **is** already annotated and **is** re-entered on the 
function-argument / dialect-override paths, so the trampoline does fire. The 
real problem is that the `recursive` crate's default red zone is **128 KiB**, 
but a single `expr_to_sql_inner` stack frame in debug builds is **~130 KiB** — 
*larger than the red zone*. `stacker::maybe_grow` only grows when `remaining < 
red_zone`, so a frame can pass the check and then overflow before the next 
checkpoint. The unparser installed **no `StackGuard`**, unlike the planner, 
which already does exactly this in `query.rs`:
   
   ```rust
   // query.rs
   let _guard = StackGuard::new(256 * 1024);
   ```
   
   with a comment noting the same debug-frame-size issue (PR #13310).
   
   Secondary leaks: several internal recursion sites recursed through the 
public, un-annotated `expr_to_sql` (function arguments, `make_array`, 
`array_element`, `named_struct`, `map`, and the `array_has` / `date_part` 
dialect overrides), and `remove_unnecessary_nesting` (pretty mode) was not 
annotated at all — it would be the next overflow site once the first is fixed.
   
   ## What changes are included in this PR?
   
   The recursive design is kept (an iterative rewrite of the tree-building 
unparser would be a large, high-risk change for no benefit over the 
trampoline). The fix mirrors the planner:
   
   - Install `StackGuard::new(256 * 1024)` once at the top-level `expr_to_sql`.
   - Add an annotated `pub(crate) expr_to_sql_with_nesting` recursion core that 
all internal recursion sites call. This keeps pretty-mode behavior identical 
(each level still runs `remove_unnecessary_nesting`) while making every nesting 
level a stack-growth checkpoint.
   - Annotate `remove_unnecessary_nesting` with `#[recursive]`.
   
   The PR is split into two commits for review: the failing regression test 
first, then the fix.
   
   ## Are these changes tested?
   
   Yes. `test_deeply_nested_expr_does_not_overflow_stack` (gated on 
`recursive_protection`) unparses a depth-2000 `array_has` chain on 
`PostgreSqlDialect` and a depth-2000 binary chain on a 2 MiB thread. It aborts 
the process on `main` and passes with the fix. I verified empirically that with 
the default 128 KiB red zone the test still aborts, and that 256 KiB carries 
both paths past depth 5000 on a realistic thread.
   
   Note: the guarantee holds on realistically-sized thread stacks (≥1–2 MiB). 
The 512 KiB-stack figure in the issue's repro is too small to bootstrap the 
heavier `array_has` override path even after the fix, but that is the same 
memory floor the planner side has and is not a configuration used by DataFusion 
in practice.
   
   ## Are there any user-facing changes?
   
   No public API changes. Deeply nested expressions that previously aborted the 
process now unparse successfully when `recursive_protection` is enabled.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   
   https://claude.ai/code/session_01QyHc4z88pC8nQLpFx7z2vM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to