pjmore commented on issue #440:
URL:
https://github.com/apache/arrow-datafusion/issues/440#issuecomment-962503478
Some good and bad news on this front. The Tokomak optimization pass combined
with a predicate pushdown pass and a filter<-cross join to filter<-inner join
pass is able to handle TPCH Q19 with each iteration taking ~66ms on my laptop.
It performs the AstSize cost function transforms the expression into the form
that @alamb mentioned in #217. Which gives the optimized logical plan:
```
Projection: #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)
Aggregate: groupBy=[[]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) -
#lineitem.l_discount)]]
Projection: #lineitem.l_extendedprice, #lineitem.l_discount
Filter: Boolean(true)
Filter: #part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"),
Utf8("MED PKG"), Utf8("MED PACK")]) AND #part.p_size BETWEEN Int64(1) AND
Int64(10) AND #part.p_brand = Utf8("Brand#23") AND #lineitem.l_quantity BETWEEN
Int64(10) AND Int64(20) OR #part.p_container IN ([Utf8("LG CASE"), Utf8("LG
BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND #part.p_brand = Utf8("Brand#34")
AND #part.p_size BETWEEN Int64(1) AND Int64(15) AND #lineitem.l_quantity
BETWEEN Int64(20) AND Int64(30) OR #lineitem.l_quantity BETWEEN Int64(1) AND
Int64(11) AND #part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM
PACK"), Utf8("SM PKG")]) AND #part.p_brand = Utf8("Brand#12") AND #part.p_size
BETWEEN Int64(1) AND Int64(5)
Join: #lineitem.l_partkey = #part.p_partkey
Filter: #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")])
AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
TableScan: lineitem projection=Some([1, 4, 5, 6, 13, 14])
TableScan: part projection=Some([0, 3, 5, 6])
```
Bad news is that I had to use a dev branch of egg due to Send requirements
so merging the Tokomak optimizer may have to wait until the next version of egg
releases. On the bright side there is plenty of work that needs to be done
before I feel the optimizer is ready anyways:
* Allow users to create their own rules that will run alongside the existing
rules.
* Allow users to create custom cost functions.
* Allow users to run optimizer with a custom rewrite scheduler.
* Extend the optimizer to allow it to operate on plans as well as
expressions.
* Add the Case expr, this particular expression does not mesh well with
current rewrite syntax.
* Create DSL which is better suited for rewriting expressions which involve
lists and expressions that have to be of a particular type. The current rule
syntax falls short when rewriting to or from lists, there is no real way to
implement the rule
``` col1 = 'one' or col1 = 'two' or col2='three' => col1
InList('one','two','three')```
An example of the need for typed expressions would be when rewriting a
filter followed by a cross join into a inner join, as the optimization below is
only valid when the :
```
Filter: #lineitem.l_partkey = #part.p_partkey
CrossJoin:
TableScan: part
TableScan: lineitem
->
Join: #lineitem.l_partkey = #part.p_partkey
TableScan: part
TableScan: lineitem
```
as the in this case both the left and the right of the filter predicate must
be columns for this to be a valid transform.
Another capability that would be nice to have would be to bind to operators
in the same way as expression so instead of the rules for rotating the
expression tree:
```
rw!("rotate-add"; "(+ ?x (+ ?y ?z))"=>"(+ (+ ?x ?y) ?z)"),
rw!("rotate-mul"; "(* ?x (* ?y ?z))"=>"(* (* ?x ?y) ?z)"),
rw!("rotate-and"; "(and ?a (and ?b ?c))"=>"(and (and ?a ?b) ?c)"),
rw!("rotate-or"; "(or ?a (or ?b ?c))"=>"(or (or ?a ?b) ?c)"),
```
Could instead write something along the lines of:
```
rw!("rotate-expr-homogenous"; "(?op:(+|*|and|or) ?x (?op ?y ?z))" =>"(?op
(?op ?x ?y) ?z)")
```
* Determine whether combining expression and plan optimization at the same
time is viable. Would probably require custom scheduler. Would complicate
writing cost function but it might allow better optimizations to be performed.
I'm going to keep working on this in a separate repo for now because it's a
long way from being what I would consider being complete.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]