adriangb opened a new issue, #23056:
URL: https://github.com/apache/datafusion/issues/23056

   ### Describe the bug
   
   `Unparser::expr_to_sql` overflows the stack on deeply-nested expressions, 
and — more importantly — the `recursive_protection` feature does **not** 
prevent it for a large class of expressions, even though it is enabled by 
default in the `datafusion` crate.
   
   `expr_to_sql_inner` is annotated with `#[cfg_attr(feature = 
"recursive_protection", recursive::recursive)]`, so direct `inner -> inner` 
recursion (binary ops, etc.) is protected. However, several recursion paths 
route back through the **public** `Unparser::expr_to_sql`, which is **not** 
annotated:
   
   - `function_args_to_sql` (every scalar function argument) -> 
`self.expr_to_sql(e)`
   - `make_array_to_sql`, `array_element_to_sql`, `map_to_sql`, 
`get_field_to_sql` -> `self.expr_to_sql(...)`
   - dialect scalar-function overrides, e.g. 
`PostgreSqlDialect::array_has_to_sql_any` -> `unparser.expr_to_sql(needle)` / 
`expr_to_sql(haystack)`
   
   Because the `recursive` trampoline only kicks in on the annotated 
`expr_to_sql_inner`, recursing through the un-annotated public `expr_to_sql` 
means the stack-growing protection is effectively bypassed for nested function 
arguments. So a query whose unparse path goes through nested scalar functions 
overflows the real OS stack regardless of `recursive_protection`.
   
   This is related to #19787 (general "reduce recursion level of the unparser") 
but is a distinct, concrete defect: the protection that already exists is 
silently defeated on the function-argument / dialect-override paths.
   
   ### To Reproduce
   
   Self-contained, public-API reproduction (uses `datafusion-functions-nested` 
for `array_has`, but any nested scalar function — e.g. nested `array_element`, 
`make_array`, or even nested `Expr` arithmetic via `function_args_to_sql`-bound 
functions — exhibits the same behavior; a deep chain of plain binary `+` 
overflows at the same rate too):
   
   ```rust
   use datafusion::logical_expr::{col, lit, Expr};
   use datafusion::sql::unparser::{dialect::PostgreSqlDialect, Unparser};
   use datafusion_functions_nested::expr_fn::array_has;
   
   fn main() {
       const DEPTH: usize = 2_000;
   
       // A simple linear chain: array_has(array_has(... array_has(col, 'x') 
...), 'x')
       let mut e: Expr = col("c");
       for _ in 0..DEPTH {
           e = array_has(e, lit("x"));
       }
   
       // Unparse on a small thread stack to make the overflow 
fast/deterministic.
       // (It also overflows on the default stack at a larger DEPTH.)
       let handle = std::thread::Builder::new()
           .stack_size(512 * 1024)
           .spawn(move || {
               let dialect = PostgreSqlDialect {};
               let unparser = Unparser::new(&dialect);
               unparser.expr_to_sql(&e).map(|sql| sql.to_string())
           })
           .unwrap();
       let _ = handle.join(); // aborts the process before returning
   }
   ```
   
   Output:
   
   ```
   thread '<unknown>' has overflowed its stack
   fatal runtime error: stack overflow, aborting
   ```
   
   The same overflow occurs whether or not the `recursive_protection` feature 
is enabled, and with both `PostgreSqlDialect` and `DefaultDialect`. The cost is 
roughly linear in nesting depth (measured ~200 KB of stack per nested level in 
a debug build), so the protection genuinely never engages on this path — it 
does not merely shift the threshold.
   
   ### Expected behavior
   
   With `recursive_protection` enabled (the default for the `datafusion` 
crate), unparsing a deeply-nested but otherwise valid expression should not 
overflow the OS stack — it should either succeed (via the stack-growing 
trampoline) or return a `Result::Err` for a recursion-limit, never abort the 
process.
   
   ### Additional context
   
   - datafusion commit: `0fc55d06797cac2190239c82b2d5f0a06b4d42e7` (workspace 
version `54.0.0`)
   - Affected code:
     - `datafusion/sql/src/unparser/expr.rs`: public `expr_to_sql` (~L96) is 
not `#[recursive]`; `expr_to_sql_inner` (~L105) is. The leak sites are the 
`self.expr_to_sql(...)` calls in `make_array_to_sql` (~L663), 
`array_element_to_sql` (~L690), `map_to_sql` (~L784/L792), and 
`function_args_to_sql` (~L926).
     - `datafusion/sql/src/unparser/dialect.rs`: 
`PostgreSqlDialect::array_has_to_sql_any` (~L409) calls 
`unparser.expr_to_sql(needle)` / `expr_to_sql(haystack)`.
   
   Likely fixes: have these internal recursion sites call the (annotated) 
`expr_to_sql_inner` instead of the public `expr_to_sql`, and/or move the 
`#[recursive]` annotation onto the public `expr_to_sql` (or add an annotated 
private recursion entry point that both use). Note `expr_to_sql` also runs 
`remove_unnecessary_nesting` in pretty mode, which would need the same 
treatment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to