mslapek commented on issue #4628:
URL: 
https://github.com/apache/arrow-datafusion/issues/4628#issuecomment-1470106106

   > I think having the return value indicate if changes have been made as in 
your original example makes sense to me.
   
   At the first sight the idea seems to be compelling...
   
   Nevertheless I'm quite **pessimistic about the idea**. 😵‍💫
   
   The main reason are optimisations, which **UNDO** other optimisations!
   
   *(sorry for the long post! but it's core architecture stuff...)*
   
   <br/>
   
   ## Examples of cancelling optimisations
   
   **Example 1. Commutative optimisations**
   
   Some operations are commutative, like *projection* or *limit* - it is used 
in both `PushDownProjection` and `PushDownLimit` - but in a different 
directions (grep `commute` in these files).
   
   So for some plans they will always give `Some(new_plan)`. 😕
   
   **Example 2. Undoing inside of an optimisation**
   
   Inside `PushDownLimit` we have:
   
   1. `Limit A -> Projection -> Limit B`
   2. Added a limit after the projection: `Limit A -> Projection -> Limit C -> 
Limit B`
   3. Merge of the `Limit C` and `Limit B` in ANOTHER invocation of 
`try_optimize`: `Limit A -> Projection -> Limit B`
   
   We have many invocations of `try_optimize` due to `apply_order` option.
   
   The point is that for a fixed point (merge of `Limit C + Limit B` is `Limit 
B`) `PushDownLimit` will always yield `Some(...)`.
   
   <br/>
   
   ## Theoretical pain
   
   Let's look at the issue formally... 😎
   
   Consider small changes `Δ_1, Δ_2, ... Δ_n` - some signed integers.
   
   The total change is:
   
   ```
   Δ_all = Δ_1 + Δ_2 + ... + Δ_n
   ```
   
   So `Δ_all ≠ 0` &nbsp; **implies** &nbsp; `Δ_i ≠ 0` for some `i`.
   
   The premise of `Option<Plan>` is that the implication holds in **the another 
direction** - which is NOT the case!
   
   Just give `Δ_1 = 1`, `Δ_2 = -1`, `Δ_3 = 0`... (`Δ_2` cancels changes from 
`Δ_1`!)
   
   ## Alternatives?
   
   Let's look again at the implication - how can we use it?
   
   `Δ_all ≠ 0` &nbsp; **implies** &nbsp; `Δ_i ≠ 0` for some `i`.
   
   I would consider to use some stats about the tree - as a kind of heuristics. 
For example: **number of nodes in the tree**.
   
   Each optimization could return `(LogicalPlan, Δ of number of nodes)`. If sum 
of all `Δ of number of nodes` is nonzero - then we know the tree has changed.
   
   I guess there could be other useful stats, like `∑ (node_type * node depth)`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to