altmannmarcelo opened a new pull request, #2378:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/2378
## Summary
`Parser::parse_function_args` runs in `O(2^n)` time on `n`-deep nested
function-call arguments (e.g. `replace(replace(replace(...)))`) for any dialect
where `supports_named_fn_args_with_expr_name()` is true — notably
**PostgreSQL** and **MSSQL**. A 20-level chain takes ~43s; a 30-level chain
would take ~12 hours. The fix makes parse time linear.
## Root cause
To support expression-named arguments like `JSON_OBJECT(a + b : c)`,
`parse_function_args` speculatively parses the *entire* argument as an
expression to see whether a named-argument operator (`=>`, `:=`, `:`) follows:
```rust
let arg = if self.dialect.supports_named_fn_args_with_expr_name() {
self.maybe_parse(|p| {
let name = p.parse_expr()?; // parse the
WHOLE argument
let operator = p.parse_function_named_arg_operator()?; // needs
=>/:=/:
let arg = p.parse_wildcard_expr()?.into();
Ok(FunctionArg::ExprNamed { name, arg, operator })
})?
} else { ... };
if let Some(arg) = arg { return Ok(arg); }
let wildcard_expr = self.parse_wildcard_expr()?; // RE-PARSE the
same argument
```
For a plain positional argument that is itself a function call, the
speculative `parse_expr()` consumes the whole nested subtree, finds no
named-arg operator, `maybe_parse` rewinds, and `parse_wildcard_expr()` parses
the identical subtree again. Each nesting level parses its child twice: `T(n) =
2*T(n-1)` → `O(2^n)`.
Introduced in commit `92be237` — "Add support for MSSQL's
`JSON_ARRAY`/`JSON_OBJECT` expr (#1507)" — which changed named-argument
detection from a single-token `parse_identifier` lookahead to this speculative
full-expression parse.
## Fix
Parse the leading expression once, then check for a named-argument operator
(an O(1) token check). If one follows, it is a named argument; otherwise the
same already-parsed expression is the positional argument. The positional tail
(`* EXCLUDE`, `expr AS alias`) is factored into `parse_unnamed_function_arg`,
shared by both paths. No re-parse; behavior is otherwise unchanged.
## Evidence
Pathological input: `SELECT replace(replace(... x ..., 'a','b'), 'a','b')`,
PostgreSQL dialect. Parse time per nesting depth, before vs after the fix
(milliseconds):
| depth | before | after |
|------:|-------:|------:|
| 10 | 48 ms | 0.025 ms |
| 20 | 42,600 ms | 0.049 ms |
| 30 | ~43,000,000 ms (extrapolated, not run) | 0.071 ms |
Both columns are parse-only (min of 3 runs). Before: parse time doubles per
level — `O(2^n)` (additional measured points: depth 15 = 1,350 ms, depth 18 =
10,600 ms). After: linear, ~0.0024 ms per level. At depth 20 that is roughly an
870,000x speedup. The depth-30 case was **not run** before the fix;
extrapolating the doubling (42,600 ms x 2^10) puts it near 12 hours.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]