fmonjalet opened a new pull request, #16031:
URL: https://github.com/apache/datafusion/pull/16031

   ## Which issue does this PR close?
   
   - Closes #16030
   
   ## Rationale for this change
   
   Some substrait plans can cause DataFusion to trigger stack overflow, 
crashing the hosting process and preventing to run them. This stack overflow is 
caused by very large recursion levels. See issue for more details.
   
   This kind of issue may also occur in the SQL --> Logical plan path, but I 
did not investigate it yet. The "ideal" fix would be not to rely on recursion 
for most operations on plans, but this sounds like an unreasonably large and 
complex code change for now.
   
   ## What changes are included in this PR?
   
   When transforming a substrait function call to DataFusion logical plan,
   if the substrait function maps to a DataFusion binary operator, but has
   more than 2 arguments, it is mapped to a tree of BinaryExpr. This
   BinaryExpr tree is not balanced, and its depth is the number of
   arguments:
   
   ```
          Op
         /  \
       arg1  Op
            /  \
          arg2  ...
                /  \
              argN  Op
   ```
   
   Since many functions manipulating the logical plan are recursive, it
   means that N arguments result in an `O(N)` recursion, leading to stack
   overflows for large N (1000 for example).
   
   Transforming these function calls into a balanced tree mitigates the
   issue:
   
   ```
                .__ Op __.
               /          \
             Op            Op
            /  \          /  \
          ...  ...      ...  ...
         /  \  /  \    /  \  /  \
       arg1        ...          argN
   ```
   
   The recursion depth is now `O(log2(N))`, meaning that 1000 arguments
   results in a depth of ~10, and it would take 2^1000 arguments to reach a
   depth of 1000, which is a vastly unreasonable amount of data.
   
   Therefore, it's not possible to use this flaw anymore to trigger stack
   overflows in processes running DataFusion.
   
   The new balanced tree preserves the order of execution (`arg1` evaluated
   before `arg2`) and has the same order of magnitude of operator nodes.
   The main change is that there is now at least `O(log2(N))` operators 
evaluated,
   whereas before the minimum was 1, if the first operator exited after 
evaluating
   only its left side.
   
   ## Are these changes tested?
   
   Yes
   
   ## Are there any user-facing changes?
   
   No, the change to the logical plan is semantically equivalent and should not 
be exposed to users.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to