+1(000) on this! This should massively reduce allocations done in the analyzer, and it is much more efficient. I also can't count the times that I had to increase the number of iterations. This sounds like a no-brainer to me.
I do have two questions: - How do we ensure that we don't accidentally break the analyzer? Are existing tests enough? - How can we introduce this incrementally? Can we run the analyzer in mixed mode (both single pass rules and the existing tree traversal rules) for a while? Cheers, Herman On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <vvdr....@gmail.com> wrote: > Hello All, > > I recently did some research in the Catalyst Analyzer area to check if > it’s possible to make it single-pass instead of fixed-point. Despite the > flexibility of the current fixed-point approach (new functionality - new > rule), it has some drawbacks. The dependencies between the rules are > unobvious, so it’s hard to introduce changes without having the full > knowledge. By modifying one rule, the whole chain of transformations can > change in an unobvious way. Since we can hit the maximum number of > iterations, there’s no guarantee that the plan is going to be resolved. And > from a performance perspective the Analyzer can be quite slow for large > logical plans and wide tables. > > The idea is to resolve the tree in a single-pass post-order traversal. > This approach can be beneficial for the following reasons: > > - > > Code can rely on child nodes being resolved > - > > Name visibility can be controlled by a scope, which would be > maintained during the traversal > - > > Random in-tree lookups are disallowed during the traversal, so the > resolution algorithm would be linear with respect to the parsed tree size > - > > Analyzer would deterministically produce resolved logical plan or a > descriptive error > > > The prototype shows that it’s possible to do a bottom-up Analyzer > implementation and reuse a large chunk of node-processing code from rule > bodies. I did the performance benchmark on two cases where the current > Analyzer struggles the most - wide nested views and large logical plans and > got 7-10x performance speedups. > > Is this something that would be interesting for the Spark community? What > would be the best way to proceed with this idea? > > Regards, > > Vladimir Golubev. >