pjmore commented on issue #440:
URL: 
https://github.com/apache/arrow-datafusion/issues/440#issuecomment-962503478


   Some good and bad news on this front. The Tokomak optimization pass combined 
with a predicate pushdown pass and a filter<-cross join to filter<-inner join 
pass is able to handle TPCH Q19 with each iteration taking ~66ms on my laptop. 
It performs the AstSize cost function transforms the expression into the form 
that @alamb mentioned in #217. Which gives the optimized logical plan:
   ```
   Projection: #SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)
     Aggregate: groupBy=[[]], aggr=[[SUM(#lineitem.l_extendedprice * Int64(1) - 
#lineitem.l_discount)]]
       Projection: #lineitem.l_extendedprice, #lineitem.l_discount
         Filter: Boolean(true)
           Filter: #part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), 
Utf8("MED PKG"), Utf8("MED PACK")]) AND #part.p_size BETWEEN Int64(1) AND 
Int64(10) AND #part.p_brand = Utf8("Brand#23") AND #lineitem.l_quantity BETWEEN 
Int64(10) AND Int64(20) OR #part.p_container IN ([Utf8("LG CASE"), Utf8("LG 
BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND #part.p_brand = Utf8("Brand#34") 
AND #part.p_size BETWEEN Int64(1) AND Int64(15) AND #lineitem.l_quantity 
BETWEEN Int64(20) AND Int64(30) OR #lineitem.l_quantity BETWEEN Int64(1) AND 
Int64(11) AND #part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM 
PACK"), Utf8("SM PKG")]) AND #part.p_brand = Utf8("Brand#12") AND #part.p_size 
BETWEEN Int64(1) AND Int64(5)
             Join: #lineitem.l_partkey = #part.p_partkey
               Filter: #lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) 
AND #lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
                 TableScan: lineitem projection=Some([1, 4, 5, 6, 13, 14])
               TableScan: part projection=Some([0, 3, 5, 6])
   
   ```
   
   Bad news is that I had to use a dev branch of egg due to Send requirements 
so merging the Tokomak optimizer may have to wait until the next version of egg 
releases. On the bright side there is plenty of work  that needs to be done 
before I feel the optimizer is ready anyways:
   * Allow users to create their own rules that will run alongside the existing 
rules.
   * Allow users to create custom cost functions.
   * Allow users to run optimizer with a custom rewrite scheduler.
   * Extend the optimizer to allow it to operate on plans as well as 
expressions.
   * Add the Case expr, this particular expression does not mesh well with 
current rewrite syntax.
   * Create DSL which is better suited for rewriting expressions which involve 
lists and expressions that have to be of a particular type. The current rule 
syntax falls short when rewriting to or from lists, there is no real way to 
implement the rule
   ``` col1 = 'one' or col1 = 'two' or col2='three' => col1 
InList('one','two','three')```
   An example of the need for typed expressions would be when rewriting a 
filter followed by a cross join into a inner join, as the optimization below is 
only valid when the :
   ```
   Filter: #lineitem.l_partkey = #part.p_partkey
       CrossJoin: 
           TableScan: part
           TableScan: lineitem
   -> 
   Join:  #lineitem.l_partkey = #part.p_partkey
       TableScan: part
       TableScan: lineitem
   ```
   as the in this case both the left and the right of the filter predicate must 
be columns for this to be a valid transform.
   Another capability that would be nice to have would be to bind to operators 
in the same way as expression so instead of the rules for rotating the 
expression tree:
   ```
   rw!("rotate-add"; "(+ ?x (+ ?y ?z))"=>"(+ (+ ?x ?y) ?z)"),
   rw!("rotate-mul"; "(* ?x (* ?y ?z))"=>"(* (* ?x ?y) ?z)"),            
   rw!("rotate-and"; "(and ?a (and ?b ?c))"=>"(and (and ?a ?b) ?c)"),
   rw!("rotate-or"; "(or ?a (or ?b ?c))"=>"(or (or ?a ?b) ?c)"),
   ```
   Could instead write something along the lines of:
   ```
   rw!("rotate-expr-homogenous"; "(?op:(+|*|and|or) ?x (?op ?y ?z))" =>"(?op 
(?op ?x ?y) ?z)")
   ```
   
   * Determine whether combining expression and plan optimization at the same 
time is viable. Would probably require custom scheduler. Would complicate 
writing cost function but it might allow better optimizations to be performed. 
   
   I'm going to keep working on this in a separate repo for now because it's a 
long way from being what I would consider being complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to