mingmwang commented on PR #5290:
URL:
https://github.com/apache/arrow-datafusion/pull/5290#issuecomment-1432479006
> The PR is very large in scope. It changes parts of the old code (and
certainly makes some changes to its tests), and also adds new code (and new
tests). It would be much easier to review this if it were broken down to two
PRs, where the first one only replicates the current functionality, has no
functionality regressions, and does not change any tests at all; with the
second PR adding new functionality. Right now, the new rule is significantly
longer than the old rule (which is bad), but it offers more functionality
(which is great). So is switching from bottom-up to top-down a good change or a
bad change? We can't tell easily.
>
> Now, let me share my (very) preliminary impression so far after a cursory
look: I see that it has better handling of sort preserving merges, smarter
push-down of sorts under unions, and adds support for sort merge joins. These
are the good bits. The cons are that it seems to lose partition awareness
(though I'm not sure about this yet) and it seems to regress on some cases
where it was doing better before. I think at least some of these are due to the
presumption that there is a global output ordering to preserve, and I am not
sure I agree with that.
>
> Anyway, we will disentangle and review in detail, but I want to give you a
heads up that this will take some time. We will need to analyze every case
carefully, go back to the old version of the code (and tests), compare and
contrast etc. Before we form an idea on the merits of bottom-up vs. top-down,
our goal will be to create _two functionally equal_ implementations passing
_exactly the same_ test suite. Without that, it is not possible to objectively
decide.
>
> Whatever the result on bottom-up vs. top-down is, I think this exercise
will end up making the rule better, so that's great 🚀 I will keep you posted as
we make progress in the upcoming days.
Sure, please take your time and I will add more comments to the code to
explain the rule process.
The new rule looks significantly longer than the original one is because of
handling the propagating of sort requirements down. But I think the sort
removing/adding procedure is very clear and the property/requirement driven
framework is more powerful. It can effectively handling below case and figure
out an optimal plan without adding Sort and then removing Sort back and forth.
```
Required('a', 'b', 'c')
Required('a', 'b')
Required('a')
```
We can leverage the same framework to do more advanced optimizations like
re-ordering(PostgreSQL has this optimization) the multiple window execs in the
plan tree and further reduce the number Sort. Generally I think the top-down
based approach is more easy and straightforward to collect and propagate
necessary properties and find the global optimal plan.
```
Required/Order by ('a', 'b', 'c')
WindowExec1 Required('a', 'b', 'c')
WindowExec2 Required('x', 'y', 'z')
Order by ('x', 'y', 'z')
```
Some UT results are changed. Yes, I think the major arguing point is whether
we should preserve output ordering during optimization process or we can trim
the unnecessary sort columns.
As I know, `SparkSQL` preserves the output ordering(SparkSQL does not do
very sophisticated sort optimizations), `PostgreSQL` sometimes preserves the
output ordering but sometime not(I guess this is decided by the top/parent
operators, if they are ordering sensitive, but I'm not sure).
For DataFusion, my preference is since we alway define the
`maintains_input_order()` method for physical plan nodes, if it is true, we
should preserve output ordering and should not trim or reverse output ordering,
otherwise `maintains_input_order()` is meaningless and very confusing.
There are some other UT result changes, I am not sure whether they are due
to the original bug or the new rule introduced regression, need to double
confirm with you and check carefully, especially this one
`test_window_agg_complex_plan`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]