[GitHub] [arrow-datafusion] mingmwang commented on issue #1972: DataFusion Optimizer framework discussion

GitBox Mon, 13 Jun 2022 22:58:07 -0700


mingmwang commented on issue #1972:
URL: 
https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1154749123

Nice discussion!

I see many people raised the points to build the volcano/cascades style
planner for DataFusion, either on Egg or to build from scratch. To implement a
cascades style planner, usually there are three parts:
1: The optimizer framework, including the memo structure which consists of
group of expressions, the search and prune tasks, duplicate detection etc.
2: Rule sets consist of the transformation rules and implementation rules.
3: Cost model, stats estimation.

The whole search and prune algorithm is performed by the specific tasks. It
defines five different types of optimization tasks: Group Optimization; Group
Exploration; Expression Optimization; Input Optimization; Rule Application.

Compared to a traditional heuristic planner, a cascades style optimizer is
much more complex. But the most difficult and challenge work is in the Part 3,
the stats estimation. Even we can collect all the necessary stats like rows
counts, cardinality and histogram for all the columns, make them correct and up
to date, it is still hard to derive the correct stats from bottom leaf
operators(table scan) to the top operators. There are many papers regarding
stats estimation, lot of maths and magic.

And in production, the cascade style optimizer is also hard to triage and
debug, because of the complexity of the search space, the unreliable stats, or
just because there is some bug in the rule.implementation, it is hard to debug
and figure out why the planner choose such a plan.

Because of the unreliable stats, it is hard to come up with a real optimal
plan before execution. Some other engines take another approach and try to
adaptively adjust the execution plan in runtime, SparkSQL's AQE is an example.
Another example is Snowflake's adaptive aggregation operator placement:
https://patents.justia.com/patent/20210089533
It is not that general but the adaptive execution can solve the major
physical plan selection problem.

Today, some open source Query/DB engines had implemented the
volcano/cascades optimizer like Apache Calcite, Greenplum Orca. They take many
years to make them solid and mature. For Apache calcite, most engines just
leverage Calcite as a SQL parser and heuristic planner, Hive try to leverage
Calcite to implement an CBO optimizer, but I don't think it is a success story.

My opinion for DataFusion's optimizer framework is that we should continue
focus on the heuristic planner approach in current phase, implement an
optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy
to add new rules. In future, we can go with the adaptive execution approach.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on issue #1972: DataFusion Optimizer framework discussion

Reply via email to