[
https://issues.apache.org/jira/browse/ARROW-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238316#comment-17238316
]
Daniël Heres edited comment on ARROW-10716 at 11/24/20, 6:57 PM:
-----------------------------------------------------------------
A summary of some points I found when coming up with a new framework for polars
([https://github.com/ritchie46/polars]) that uses some ideas from the Spark
catalyst optimizer [1]
Some points I liked about the design from Spark Catalyst:
* Recursion on tree should be only be needed once, and not for each
optimization pass. So every rule can be written using simple pattern matching.
This can be captured in some kind of framework.
* A large nr. of optimization rules should run a nr. of times to reach a fixed
point, i.e. running until the logical plan doesn't change anymore. If doing
this, it is important that all of your optimization rules only make the tree
"smaller" in some sense. So either should reduce the nr. of nodes or make the
plan "cheaper".
The optimizer I made for Polars is in very early stages, but I did some design
and first iterations to come up with a first version of a optimization
framework.
* In Rust, if elements in your tree are Boxed, you need to clone part of the
tree when you want to mutate part of the tree. So simple recursing the tree +
mutating it in scala is not possible without changing. You could maybe wrap
everything in something like Arc/RC <RefCell>>, but this has a higher overhead.
You could also generate a whole new tree every iteration. This will however be
quite a bit slower, especially if you would do this per optimization rule which
can grow a lot!
Some points that a first iteration is different than the optimizer in Spark
* It uses an tree backed by an arena to efficiently allocate data for the tree
and mutate it. This means that if you don't generate new nodes, you don't even
allocate, just switch some index to different nodes around. Also a tree in a
arena is very nice for the locality of the data.
The arena brings a bit more unsafety, as you
* Uses manual recursion (with pre allocated stack) instead of the call stack
to recurse (a bit uglier, but if you only write it once can be worth it for
performance).
* In Catalyst, only a single optimization rule runs until reaching a fixed
point, and then moves to . In the Polars version, all rules run in the inner
loop until the whole optimization reaches a fixed point. Benefit is that you
don't have to make sure the order of the rules is important. Also it can bring
_more_ optimizations, as e.g. a rule to evaluate some expressions can have an
effect on a rule to propagate nulls that can have an effect on predicate
pushdown, etc.
* In Catalyst you have to note whether your optimization needs to recurses
topdown or bottom up (for example more useful to constant folding as otherwise
you would need lots of iterations to fold a complex contant expression). In
this optimizer, the optimizer does both itself, by also optimizing a node right
after it changed. This means that the optimizer needs to do perform iterations
in general, and you need to think less about it.
TODO for design in Polars:
* Some optimization rules can be more expensive than others. It might make
sense to keep track of each node individually to check whether it changed
* Different optimization rules need different input, like the schema/type of a
column, etc.
* Some optimizations need to keep track of state, this is not yet handled in
this optimizer.
[1 ]http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
was (Author: dandandan):
A summary of some points I found when coming up with a new framework for polars
(https://github.com/ritchie46/polars) that uses some ideas from the Spark
catalyst optimizer [1]
Some points I liked about the design from Spark Catalyst:
* Recursion on tree should be only be needed once, and not for each
optimization pass. So every rule can be written by mostly. This can be captured
in some kind of framework.
* A large nr. of optimization rules should run a nr. of times to reach a fixed
point, i.e. running until the logical plan doesn't change anymore. If doing
this, it is important that all of your optimization rules only make the tree
"smaller" in some sense. So either should reduce the nr. of nodes or make the
plan "cheaper".
The optimizer I made for Polars is in very early stages, but I did some design
and first iterations to come up with a first version of a optimization
framework.
* In Rust, if elements in your tree are Boxed, you need to clone part of the
tree when you want to mutate part of the tree. So simple recursing the tree +
mutating it in scala is not possible without changing. You could maybe wrap
everything in something like Arc/RC <RefCell>>, but this has a higher overhead.
You could also generate a whole new tree every iteration. This will however be
quite a bit slower, especially if you would do this per optimization rule which
can grow a lot!
Some points that a first iteration is different than the optimizer in Spark
* It uses an tree backed by an arena to efficiently allocate data for the tree
and mutate it. This means that if you don't generate new nodes, you don't even
allocate, just switch some index to different nodes around. Also a tree in a
arena is very nice for the locality of the data.
The arena brings a bit more unsafety, as you
* Uses manual recursion (with pre allocated stack) instead of the call stack to
recurse (a bit uglier, but if you only write it once can be worth it for
performance).
* In Catalyst, only a single optimization rule runs until reaching a fixed
point, and then moves to . In the Polars version, all rules run in the inner
loop until the whole optimization reaches a fixed point. Benefit is that you
don't have to make sure the order of the rules is important. Also it can bring
_more_ optimizations, as e.g. a rule to evaluate some expressions can have an
effect on a rule to propagate nulls that can have an effect on predicate
pushdown, etc.
* In Catalyst you have to note whether your optimization needs to recurses
topdown or bottom up (for example more useful to constant folding as otherwise
you would need lots of iterations to fold a complex contant expression). In
this optimizer, the optimizer does both itself, by also optimizing a node right
after it changed. This means that the optimizer needs to do perform iterations
in general, and you need to think less about it.
TODO for design in Polars:
* Some optimization rules can be more expensive than others. It might make
sense to keep track of each node individually to check whether it changed
* Different optimization rules need different input, like the schema/type of a
column, etc.
* Some optimizations need to keep track of state, this is not yet handled in
this optimizer.
[1 ]http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
> [Rust] [DataFusion] Address limitations of logical expression rewrite logic
> ---------------------------------------------------------------------------
>
> Key: ARROW-10716
> URL: https://issues.apache.org/jira/browse/ARROW-10716
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Andy Grove
> Priority: Major
>
> We have a generic approach to rewriting expressions in operators that worked
> ok until we got to the CASE expression, which is more complex, and required
> adding some hacky code to make it work.
> We should come up with a better design.
> See the discussion on [https://github.com/apache/arrow/pull/8746] for more
> context and links to proposed solutions.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)