- I think we can rely on the current tests. One possibility would be to dual-run both Analyzer implementations if `Utils.isTesting` and compare the (normalized) logical plans - We can implement the analyzer functionality by milestones (Milestone 0: Project, Filter, UnresolvedInlineTable, Milestone 2: Support main datasources, ...). Running both analyzers in mixed mode may lead to unexpected logical plan problems, because that would introduce a completely different chain of transformations
On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <her...@databricks.com> wrote: > +1(000) on this! > > This should massively reduce allocations done in the analyzer, and it is > much more efficient. I also can't count the times that I had to increase > the number of iterations. This sounds like a no-brainer to me. > > I do have two questions: > > - How do we ensure that we don't accidentally break the analyzer? Are > existing tests enough? > - How can we introduce this incrementally? Can we run the analyzer in > mixed mode (both single pass rules and the existing tree traversal rules) > for a while? > > Cheers, > Herman > > On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <vvdr....@gmail.com> > wrote: > >> Hello All, >> >> I recently did some research in the Catalyst Analyzer area to check if >> it’s possible to make it single-pass instead of fixed-point. Despite the >> flexibility of the current fixed-point approach (new functionality - new >> rule), it has some drawbacks. The dependencies between the rules are >> unobvious, so it’s hard to introduce changes without having the full >> knowledge. By modifying one rule, the whole chain of transformations can >> change in an unobvious way. Since we can hit the maximum number of >> iterations, there’s no guarantee that the plan is going to be resolved. And >> from a performance perspective the Analyzer can be quite slow for large >> logical plans and wide tables. >> >> The idea is to resolve the tree in a single-pass post-order traversal. >> This approach can be beneficial for the following reasons: >> >> - >> >> Code can rely on child nodes being resolved >> - >> >> Name visibility can be controlled by a scope, which would be >> maintained during the traversal >> - >> >> Random in-tree lookups are disallowed during the traversal, so the >> resolution algorithm would be linear with respect to the parsed tree size >> - >> >> Analyzer would deterministically produce resolved logical plan or a >> descriptive error >> >> >> The prototype shows that it’s possible to do a bottom-up Analyzer >> implementation and reuse a large chunk of node-processing code from rule >> bodies. I did the performance benchmark on two cases where the current >> Analyzer struggles the most - wide nested views and large logical plans and >> got 7-10x performance speedups. >> >> Is this something that would be interesting for the Spark community? What >> would be the best way to proceed with this idea? >> >> Regards, >> >> Vladimir Golubev. >> >