Re: [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Herman van Hovell Wed, 14 Aug 2024 06:59:20 -0700

+1(000) on this!

This should massively reduce allocations done in the analyzer, and it is
much more efficient. I also can't count the times that I had to increase
the number of iterations. This sounds like a no-brainer to me.


I do have two questions:

   - How do we ensure that we don't accidentally break the analyzer? Are
   existing tests enough?
   - How can we introduce this incrementally? Can we run the analyzer in
   mixed mode (both single pass rules and the existing tree traversal rules)
   for a while?

Cheers,
Herman

On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <[email protected]> wrote:

> Hello All,
>
> I recently did some research in the Catalyst Analyzer area to check if
> it’s possible to make it single-pass instead of fixed-point. Despite the
> flexibility of the current fixed-point approach (new functionality - new
> rule), it has some drawbacks. The dependencies between the rules are
> unobvious, so it’s hard to introduce changes without having the full
> knowledge. By modifying one rule, the whole chain of transformations can
> change in an unobvious way. Since we can hit the maximum number of
> iterations, there’s no guarantee that the plan is going to be resolved. And
> from a performance perspective the Analyzer can be quite slow for large
> logical plans and wide tables.
>
> The idea is to resolve the tree in a single-pass post-order traversal.
> This approach can be beneficial for the following reasons:
>
>    -
>
>    Code can rely on child nodes being resolved
>    -
>
>    Name visibility can be controlled by a scope, which would be
>    maintained during the traversal
>    -
>
>    Random in-tree lookups are disallowed during the traversal, so the
>    resolution algorithm would be linear with respect to the parsed tree size
>    -
>
>    Analyzer would deterministically produce resolved logical plan or a
>    descriptive error
>
>
> The prototype shows that it’s possible to do a bottom-up Analyzer
> implementation and reuse a large chunk of node-processing code from rule
> bodies. I did the performance benchmark on two cases where the current
> Analyzer struggles the most - wide nested views and large logical plans and
> got 7-10x performance speedups.
>
> Is this something that would be interesting for the Spark community? What
> would be the best way to proceed with this idea?
>
> Regards,
>
> Vladimir Golubev.
>

Re: [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reply via email to