Re: [I] Keep track of common sub-expression across logical plan nodes [arrow-datafusion]

via GitHub Wed, 20 Mar 2024 04:58:00 -0700


Lordworms commented on issue #9576:
URL: 
https://github.com/apache/arrow-datafusion/issues/9576#issuecomment-2009267298


   > > Hey @mustafasrepo, the reason that the first plan added a new Projection 
is that in the Rewriter it would mark the c3+c4 twice so that it judges the 
expressions needed to add an extra Projection layer. However, here I got two 
Problem and wish you could give me an answer. Currently, it seems like I have 
two ways to implement this feature
   > > 
   > > 1. I can directly go through all the expressions of a single plan and if 
I find a BinaryOp then I just add another Projection upon the current plan with 
sub query (easy solution)
   > > 2. The second one is that we should track a DAG over the expression and 
if it is referenced in another plan, we add an extra projection. (which I have 
no idea how to properly trace them in treenode recursion), I don't want to do 
another recursion.
   > >    Which one do you think is better?
   > 
   > The current approach is we use projection to calculate a complex 
expression if it is used at least twice (Otherwise projection deemed 
unnecessary). Hence, first option wouldn't work in this case. The other 
approach may work, however, it may place the projection in a sub-optimal spot. 
As an example, consider following plan,
   > 
   > ```
   > Projection(a+b)
   > --Filter (a+b=0),
   > ----Sort(a+b ASC),
   > ------TableScan(a,b)
   > ```
   > 
   > with second approach you might produce plan below (still better than 
current behaviour. However, sub-optimal)
   > 
   > ```
   > Projection(`a+b`)
   > --Filter (`a+b`=0),
   > ----Projection (a+b as `a+b`)
   > ------Sort(a+b ASC),
   > --------TableScan(a,b)
   > ```
   > 
   > where I used `a+b` to distinguish it from binary expression `a+b`. 
However, instead we could have generated following plan
   > 
   > ```
   > Projection(`a+b`)
   > --Filter (`a+b`=0),
   > ----Sort(`a+b` ASC),
   > ------Projection (a+b as `a+b`)
   > --------TableScan(a,b)
   > ```
   > 
   > Hence, I think best approach is to traverse plan from top to bottom and 
keeping the cumulative complex expression counts in the plan. For plan below
   > 
   > ```
   > Projection(a+b)
   > --Filter (a+b=0),
   > ----Sort(a+b ASC),
   > ------TableScan(a,b)
   > ```
   > 
   > This would produce
   > 
   > ```
   > Projection(a+b), ("a+b", count: 1)
   > --Filter (a+b=0), ("a+b", count: 2)
   > ----Sort(a+b ASC),  ("a+b", count: 2)
   > ------TableScan(a,b)
   > ```
   > 
   > Then after constructing, above tree. With a bottom-up traversal we can 
generate following plan
   > 
   > ```
   > Projection(`a+b`)
   > --Filter (`a+b`=0),
   > ----Sort(`a+b` ASC),
   > ------Projection (a+b as `a+b`)
   > --------TableScan(a,b)
   > ```
   > 
   > by implacing projections to calculate common expression that are used more 
than once by subsequent stages. However, I presume this would involve a lot of 
work.
   
   I got it, Thanks for your solutions. I plan to implement this today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Keep track of common sub-expression across logical plan nodes [arrow-datafusion]

Reply via email to