alamb commented on issue #1972: URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1156308944
> My opinion for DataFusion's optimizer framework is that we should continue focus on the heuristic planner approach in current phase, implement an optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy to add new rules. In future, we can go with the adaptive execution approach. Thank you @mingmwang for the writeup. I would second your assertion that almost all successful real world (e.g. commerical) query optimizers are not implemented with a cascades like framework, but instead are some combination of heuristics and cost models. I also think the point that cost models have unsolved error propagation issues -- my experience was that after about 2-3 joins, the output cardinality estimation is basically a guess, even with advanced statistics like histograms. What I would like to see in DataFusion is: 1. A solid "classic" heuristic optimizer as a default 2. Sufficient extension points that anyone who wants to experiment / create / use a different optimizer strategy can easily do so. In my mind this is like `LLVM` -- provides "state of the art" foundation and then users can customize as they need. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
