[ 
https://issues.apache.org/jira/browse/ARROW-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238316#comment-17238316
 ] 

Daniël Heres edited comment on ARROW-10716 at 11/24/20, 6:57 PM:
-----------------------------------------------------------------

A summary of some points I found when coming up with a new framework for polars 
([https://github.com/ritchie46/polars]) that uses some ideas from the Spark 
catalyst optimizer [1]

Some points I liked about the design from Spark Catalyst:
 * Recursion on tree should be only be needed once, and not for each 
optimization pass. So every rule can be written using simple pattern matching. 
This can be captured in some kind of framework.
 * A large nr. of optimization rules should run a nr. of times to reach a fixed 
point, i.e. running until the logical plan doesn't change anymore. If doing 
this, it is important that all of your optimization rules only make the tree 
"smaller" in some sense. So either should reduce the nr. of nodes or make the 
plan "cheaper". 

The optimizer I made for Polars is in very early stages, but I did some design 
and first iterations to come up with a first version of a optimization 
framework.
 * In Rust, if elements in your tree are Boxed, you need to clone part of the 
tree when you want to mutate part of the tree. So simple recursing the tree + 
mutating it in scala is not possible without changing. You could maybe wrap 
everything in something like Arc/RC <RefCell>>, but this has a higher overhead. 
You could also generate a whole new tree every iteration. This will however be 
quite a bit slower, especially if you would do this per optimization rule which 
can grow a lot!

Some points that a first iteration is different than the optimizer in Spark
 * It uses an tree backed by an arena to efficiently allocate data for the tree 
and mutate it. This means that if you don't generate new nodes, you don't even 
allocate, just switch some index to different nodes around. Also a tree in a 
arena is very nice for the locality of the data.

The arena brings a bit more unsafety, as you 
 * Uses manual recursion (with pre allocated stack) instead of the call stack 
to recurse (a bit uglier, but if you only write it once can be worth it for 
performance).

 * In Catalyst, only a single optimization rule runs until reaching a fixed 
point, and then moves to . In the Polars version, all rules run in the inner 
loop until the whole optimization reaches a fixed point. Benefit is that you 
don't have to make sure the order of the rules is important. Also it can bring 
_more_  optimizations, as e.g. a rule to evaluate some expressions can have an 
effect on a rule to propagate nulls that can have an effect on predicate 
pushdown, etc.

 * In Catalyst you have to note whether your optimization needs to recurses 
topdown or bottom up (for example more useful to constant folding as otherwise 
you would need lots of iterations to fold a complex contant expression). In 
this optimizer, the optimizer does both itself, by also optimizing a node right 
after it changed. This means that the optimizer needs to do perform iterations 
in general, and you need to think less about it.

TODO for design in Polars:
 * Some optimization rules can be more expensive than others. It might make 
sense to keep track of each node individually to check whether it changed 
 * Different optimization rules need different input, like the schema/type of a 
column, etc.
 * Some optimizations need to keep track of state, this is not yet handled in 
this optimizer.

 [1 ]http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf


was (Author: dandandan):
A summary of some points I found when coming up with a new framework for polars 
(https://github.com/ritchie46/polars) that uses some ideas from the Spark 
catalyst optimizer [1]

Some points I liked about the design from Spark Catalyst:

* Recursion on tree should be only be needed once, and not for each 
optimization pass. So every rule can be written by mostly. This can be captured 
in some kind of framework.
* A large nr. of optimization rules should run a nr. of times to reach a fixed 
point, i.e. running until the logical plan doesn't change anymore. If doing 
this, it is important that all of your optimization rules only make the tree 
"smaller" in some sense. So either should reduce the nr. of nodes or make the 
plan "cheaper". 

The optimizer I made for Polars is in very early stages, but I did some design 
and first iterations to come up with a first version of a optimization 
framework.

* In Rust, if elements in your tree are Boxed, you need to clone part of the 
tree when you want to mutate part of the tree. So simple recursing the tree + 
mutating it in scala is not possible without changing. You could maybe wrap 
everything in something like Arc/RC <RefCell>>, but this has a higher overhead. 
You could also generate a whole new tree every iteration. This will however be 
quite a bit slower, especially if you would do this per optimization rule which 
can grow a lot!

Some points that a first iteration is different than the optimizer in Spark

* It uses an tree backed by an arena to efficiently allocate data for the tree 
and mutate it. This means that if you don't generate new nodes, you don't even 
allocate, just switch some index to different nodes around. Also a tree in a 
arena is very nice for the locality of the data.

The arena brings a bit more unsafety, as you 

* Uses manual recursion (with pre allocated stack) instead of the call stack to 
recurse (a bit uglier, but if you only write it once can be worth it for 
performance).

* In Catalyst, only a single optimization rule runs until reaching a fixed 
point, and then moves to . In the Polars version, all rules run in the inner 
loop until the whole optimization reaches a fixed point. Benefit is that you 
don't have to make sure the order of the rules is important. Also it can bring 
_more_  optimizations, as e.g. a rule to evaluate some expressions can have an 
effect on a rule to propagate nulls that can have an effect on predicate 
pushdown, etc.

* In Catalyst you have to note whether your optimization needs to recurses 
topdown or bottom up (for example more useful to constant folding as otherwise 
you would need lots of iterations to fold a complex contant expression). In 
this optimizer, the optimizer does both itself, by also optimizing a node right 
after it changed. This means that the optimizer needs to do perform iterations 
in general, and you need to think less about it.


TODO for design in Polars:

* Some optimization rules can be more expensive than others. It might make 
sense to keep track of each node individually to check whether it changed 
* Different optimization rules need different input, like the schema/type of a 
column, etc.
* Some optimizations need to keep track of state, this is not yet handled in 
this optimizer.


 [1 ]http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

> [Rust] [DataFusion] Address limitations of logical expression rewrite logic
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-10716
>                 URL: https://issues.apache.org/jira/browse/ARROW-10716
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Priority: Major
>
> We have a generic approach to rewriting expressions in operators that worked 
> ok until we got  to the CASE expression, which is more complex, and required 
> adding some hacky code to make it work.
> We should come up with a better design.
> See the discussion on [https://github.com/apache/arrow/pull/8746] for more 
> context and links to proposed solutions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to