[GitHub] [arrow-datafusion] mingmwang commented on pull request #5290: TopDown EnforceSorting implementation

via GitHub Wed, 15 Feb 2023 20:01:46 -0800


mingmwang commented on PR #5290:
URL: 
https://github.com/apache/arrow-datafusion/pull/5290#issuecomment-1432479006


   > The PR is very large in scope. It changes parts of the old code (and 
certainly makes some changes to its tests), and also adds new code (and new 
tests). It would be much easier to review this if it were broken down to two 
PRs, where the first one only replicates the current functionality, has no 
functionality regressions, and does not change any tests at all; with the 
second PR adding new functionality. Right now, the new rule is significantly 
longer than the old rule (which is bad), but it offers more functionality 
(which is great). So is switching from bottom-up to top-down a good change or a 
bad change? We can't tell easily.
   > 
   > Now, let me share my (very) preliminary impression so far after a cursory 
look: I see that it has better handling of sort preserving merges, smarter 
push-down of sorts under unions, and adds support for sort merge joins. These 
are the good bits. The cons are that it seems to lose partition awareness 
(though I'm not sure about this yet) and it seems to regress on some cases 
where it was doing better before. I think at least some of these are due to the 
presumption that there is a global output ordering to preserve, and I am not 
sure I agree with that.
   > 
   > Anyway, we will disentangle and review in detail, but I want to give you a 
heads up that this will take some time. We will need to analyze every case 
carefully, go back to the old version of the code (and tests), compare and 
contrast etc. Before we form an idea on the merits of bottom-up vs. top-down, 
our goal will be to create _two functionally equal_ implementations passing 
_exactly the same_ test suite. Without that, it is not possible to objectively 
decide.
   > 
   > Whatever the result on bottom-up vs. top-down is, I think this exercise 
will end up making the rule better, so that's great 🚀 I will keep you posted as 
we make progress in the upcoming days.
   
   Sure, please take your time and I will add more comments to the code to 
explain the rule process.  
   The new rule looks significantly longer than the original one is because of 
handling the propagating of sort requirements down.  But I think the sort 
removing/adding procedure is very clear and the property/requirement driven 
framework is more powerful.  It can effectively handling below case and figure 
out an optimal plan without adding Sort and then removing Sort back and forth.
   
   ```
   Required('a', 'b', 'c')
      Required('a', 'b')
         Required('a')
   ```
   
   We can leverage the same framework to do more advanced optimizations like 
re-ordering(PostgreSQL has this optimization) the multiple window execs in the 
plan tree and further reduce the number Sort. Generally I think the top-down 
based approach is more easy and straightforward to collect and propagate 
necessary properties and find the global optimal plan.
   
   ```
   Required/Order by ('a', 'b', 'c')
      WindowExec1 Required('a', 'b', 'c')
         WindowExec2 Required('x', 'y', 'z')
             Order by ('x', 'y', 'z')
   ```
   
   
   Some UT results are changed. Yes, I think the major arguing point is whether 
we should preserve output ordering during optimization process or we can trim 
the unnecessary sort columns.  
   As I know, `SparkSQL` preserves the output ordering(SparkSQL does not do 
very sophisticated sort optimizations), `PostgreSQL` sometimes preserves the 
output ordering but sometime not(I guess this is decided by the top/parent 
operators, if they are ordering sensitive, but I'm not sure).
   For DataFusion, my preference is since we alway define the 
`maintains_input_order()` method for physical plan nodes, if it is true, we 
should preserve output ordering and should not trim or reverse output ordering, 
otherwise `maintains_input_order()` is meaningless and very confusing.
   
   There are some other UT result changes, I am not sure whether they are due 
to the original bug or the new rule introduced regression, need to double 
confirm with you and check carefully, especially this one 
`test_window_agg_complex_plan`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on pull request #5290: TopDown EnforceSorting implementation

Reply via email to