fmonjalet opened a new pull request, #16031: URL: https://github.com/apache/datafusion/pull/16031
## Which issue does this PR close? - Closes #16030 ## Rationale for this change Some substrait plans can cause DataFusion to trigger stack overflow, crashing the hosting process and preventing to run them. This stack overflow is caused by very large recursion levels. See issue for more details. This kind of issue may also occur in the SQL --> Logical plan path, but I did not investigate it yet. The "ideal" fix would be not to rely on recursion for most operations on plans, but this sounds like an unreasonably large and complex code change for now. ## What changes are included in this PR? When transforming a substrait function call to DataFusion logical plan, if the substrait function maps to a DataFusion binary operator, but has more than 2 arguments, it is mapped to a tree of BinaryExpr. This BinaryExpr tree is not balanced, and its depth is the number of arguments: ``` Op / \ arg1 Op / \ arg2 ... / \ argN Op ``` Since many functions manipulating the logical plan are recursive, it means that N arguments result in an `O(N)` recursion, leading to stack overflows for large N (1000 for example). Transforming these function calls into a balanced tree mitigates the issue: ``` .__ Op __. / \ Op Op / \ / \ ... ... ... ... / \ / \ / \ / \ arg1 ... argN ``` The recursion depth is now `O(log2(N))`, meaning that 1000 arguments results in a depth of ~10, and it would take 2^1000 arguments to reach a depth of 1000, which is a vastly unreasonable amount of data. Therefore, it's not possible to use this flaw anymore to trigger stack overflows in processes running DataFusion. The new balanced tree preserves the order of execution (`arg1` evaluated before `arg2`) and has the same order of magnitude of operator nodes. The main change is that there is now at least `O(log2(N))` operators evaluated, whereas before the minimum was 1, if the first operator exited after evaluating only its left side. ## Are these changes tested? Yes ## Are there any user-facing changes? No, the change to the logical plan is semantically equivalent and should not be exposed to users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org