Re: [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Vladimir Golubev Wed, 14 Aug 2024 08:23:15 -0700

- I think we can rely on the current tests. One possibility would be to
dual-run both Analyzer implementations if `Utils.isTesting` and compare the
(normalized) logical plans
- We can implement the analyzer functionality by milestones (Milestone 0:
Project, Filter, UnresolvedInlineTable, Milestone 2: Support main
datasources, ...). Running both analyzers in mixed mode may lead to
unexpected logical plan problems, because that would introduce a completely
different chain of transformations


On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <her...@databricks.com>
wrote:

> +1(000) on this!
>
> This should massively reduce allocations done in the analyzer, and it is
> much more efficient. I also can't count the times that I had to increase
> the number of iterations. This sounds like a no-brainer to me.
>
> I do have two questions:
>
>    - How do we ensure that we don't accidentally break the analyzer? Are
>    existing tests enough?
>    - How can we introduce this incrementally? Can we run the analyzer in
>    mixed mode (both single pass rules and the existing tree traversal rules)
>    for a while?
>
> Cheers,
> Herman
>
> On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <vvdr....@gmail.com>
> wrote:
>
>> Hello All,
>>
>> I recently did some research in the Catalyst Analyzer area to check if
>> it’s possible to make it single-pass instead of fixed-point. Despite the
>> flexibility of the current fixed-point approach (new functionality - new
>> rule), it has some drawbacks. The dependencies between the rules are
>> unobvious, so it’s hard to introduce changes without having the full
>> knowledge. By modifying one rule, the whole chain of transformations can
>> change in an unobvious way. Since we can hit the maximum number of
>> iterations, there’s no guarantee that the plan is going to be resolved. And
>> from a performance perspective the Analyzer can be quite slow for large
>> logical plans and wide tables.
>>
>> The idea is to resolve the tree in a single-pass post-order traversal.
>> This approach can be beneficial for the following reasons:
>>
>>    -
>>
>>    Code can rely on child nodes being resolved
>>    -
>>
>>    Name visibility can be controlled by a scope, which would be
>>    maintained during the traversal
>>    -
>>
>>    Random in-tree lookups are disallowed during the traversal, so the
>>    resolution algorithm would be linear with respect to the parsed tree size
>>    -
>>
>>    Analyzer would deterministically produce resolved logical plan or a
>>    descriptive error
>>
>>
>> The prototype shows that it’s possible to do a bottom-up Analyzer
>> implementation and reuse a large chunk of node-processing code from rule
>> bodies. I did the performance benchmark on two cases where the current
>> Analyzer struggles the most - wide nested views and large logical plans and
>> got 7-10x performance speedups.
>>
>> Is this something that would be interesting for the Spark community? What
>> would be the best way to proceed with this idea?
>>
>> Regards,
>>
>> Vladimir Golubev.
>>
>

Re: [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reply via email to