[GitHub] [arrow-datafusion] crepererum commented on issue #1972: DataFusion Optimizer framework discussion

GitBox Fri, 25 Mar 2022 02:48:36 -0700


crepererum commented on issue #1972:
URL: 
https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1078838974



   # Egg
   @pjmore great write up, thank you. I have a few notes on your observations 
regarding egg:
   
   1. It might be worth sharing them with the egg developers, they have a 
[quite active discussion 
forum](https://github.com/egraphs-good/egg/discussions) and seem to be quite 
eager to hear what people build using egg (or where they fail)
   2. I think the explosion of graph sizes can be handled somewhat by tuning 
the cost function, by match guards (like the ones that are also used to to 
prevent mis-optimization of division by zero) or by pruning (their math 
examples uses [this 
trick](https://github.com/egraphs-good/egg/blob/f1238e3a95c7e4e65ce01384f2f88198d26657f9/tests/math.rs#L107-L108))
   3. The `(A OR B) -> (B OR A)` rule you bring up is actually an interesting 
one. When experimenting with egg I found that these symmetry rules are really 
helpful to write a maintainable, short, and easy to understand rules set. 
Imagine the rule `(A OR B) AND (A OR C) -> A OR (B AND C)`. This rule would 
only work properly if you combine it with the symmetry rule or you end up 
enumerating all "permutations" of the simplification rule. Now we might 
conclude that the symmetry rule should only be applied once or not on processed 
nodes, however for nesting it's quite important to be able to re-apply it to 
new nodes as well.
   4. I think the egg `Analysis` trait is extremely powerful, even a bit too 
complex since it can do multiple things like analysis (which then can be used 
by rules to prevent optimizations, see [this 
example](https://github.com/egraphs-good/egg/blob/f1238e3a95c7e4e65ce01384f2f88198d26657f9/tests/math.rs#L144-L153)),
 complex optimization passes (e.g. constant folding) and pruning.
   
   # Multi-objective optimization
   Another note multi-objective optimizations:
   
   From experience as well as mid-level understanding of optimization in 
general, having multiple optimizations targets is always tricky (but depending 
on the use case not unsolvable).
   
   If you really wanna keep the goals separate, then comparing solutions to 
your issue can only be done by strict comparison (e.g. plan A is better than B 
if it has lower cast AND risk). More technically, you stick to the [Pareto 
Front](https://en.wikipedia.org/wiki/Pareto_front).
   
   However this often steals you quite some potential, since massive 
improvements in one metric might result in slight disadvantages in another 
metric (e.g. you might be able shave of 90% cost by increasing the risk by 5%). 
Now most people would intuitively decide to take the risk here. So how do you 
balance the two metrics? What's often done is a linear combination (like `total 
= A * risk + B * cost`, `A` and `B` being tuning parameters). This obviously 
has its own issue (like all metrics must "scale" in the same linear way, tuning 
parameters are tricky to get right).
   
   Also see [Multiple-criteria 
decision-making](https://en.wikipedia.org/wiki/Multiple-criteria_decision_analysis).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] crepererum commented on issue #1972: DataFusion Optimizer framework discussion

Reply via email to