+1. The analyzer rule order issue has bitten me multiple times and it's very hard to make your analyzer rule bug-free if it interacts with other rules.
On Wed, Aug 21, 2024 at 2:49 AM Reynold Xin <r...@databricks.com.invalid> wrote: > +1 on this too > > When I implemented "group by all", I introduced at least two subtle bugs > that many reviewers weren't able to catch and those two bugs would not have > been possible to introduce if we had a single pass analyzer. Single pass > can make the whole framework more robust. > > > > > > > On Tue, Aug 20, 2024 at 7:23 PM Xiao Li <gatorsm...@gmail.com> wrote: > >> This sounds like a good idea! >> >> The Analyzer is complex. The changes in the new Analyzer should not >> affect the existing one. The users could add the QO rules and rely on the >> existing structures and patterns of the logical plan trees generated by the >> current one. The new Analyzer needs to generate the same logical plan >> trees as the current one. >> >> Cheers, >> >> Xiao >> >> Vladimir Golubev <vvdr....@gmail.com> 于2024年8月14日周三 11:27写道: >> >>> - I think we can rely on the current tests. One possibility would be to >>> dual-run both Analyzer implementations if `Utils.isTesting` and compare the >>> (normalized) logical plans >>> - We can implement the analyzer functionality by milestones (Milestone >>> 0: Project, Filter, UnresolvedInlineTable, Milestone 2: Support main >>> datasources, ...). Running both analyzers in mixed mode may lead to >>> unexpected logical plan problems, because that would introduce a completely >>> different chain of transformations >>> >>> On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <her...@databricks.com> >>> wrote: >>> >>>> +1(000) on this! >>>> >>>> This should massively reduce allocations done in the analyzer, and it >>>> is much more efficient. I also can't count the times that I had to increase >>>> the number of iterations. This sounds like a no-brainer to me. >>>> >>>> I do have two questions: >>>> >>>> - How do we ensure that we don't accidentally break the analyzer? >>>> Are existing tests enough? >>>> - How can we introduce this incrementally? Can we run the analyzer >>>> in mixed mode (both single pass rules and the existing tree traversal >>>> rules) for a while? >>>> >>>> Cheers, >>>> Herman >>>> >>>> On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <vvdr....@gmail.com> >>>> wrote: >>>> >>>>> Hello All, >>>>> >>>>> I recently did some research in the Catalyst Analyzer area to check if >>>>> it’s possible to make it single-pass instead of fixed-point. Despite the >>>>> flexibility of the current fixed-point approach (new functionality - new >>>>> rule), it has some drawbacks. The dependencies between the rules are >>>>> unobvious, so it’s hard to introduce changes without having the full >>>>> knowledge. By modifying one rule, the whole chain of transformations can >>>>> change in an unobvious way. Since we can hit the maximum number of >>>>> iterations, there’s no guarantee that the plan is going to be resolved. >>>>> And >>>>> from a performance perspective the Analyzer can be quite slow for large >>>>> logical plans and wide tables. >>>>> >>>>> The idea is to resolve the tree in a single-pass post-order traversal. >>>>> This approach can be beneficial for the following reasons: >>>>> >>>>> - >>>>> >>>>> Code can rely on child nodes being resolved >>>>> - >>>>> >>>>> Name visibility can be controlled by a scope, which would be >>>>> maintained during the traversal >>>>> - >>>>> >>>>> Random in-tree lookups are disallowed during the traversal, so the >>>>> resolution algorithm would be linear with respect to the parsed tree >>>>> size >>>>> - >>>>> >>>>> Analyzer would deterministically produce resolved logical plan or >>>>> a descriptive error >>>>> >>>>> >>>>> The prototype shows that it’s possible to do a bottom-up Analyzer >>>>> implementation and reuse a large chunk of node-processing code from rule >>>>> bodies. I did the performance benchmark on two cases where the current >>>>> Analyzer struggles the most - wide nested views and large logical plans >>>>> and >>>>> got 7-10x performance speedups. >>>>> >>>>> Is this something that would be interesting for the Spark community? >>>>> What would be the best way to proceed with this idea? >>>>> >>>>> Regards, >>>>> >>>>> Vladimir Golubev. >>>>> >>>>