[DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Xiao Li Tue, 20 Aug 2024 10:24:12 -0700

This sounds like a good idea!

The Analyzer is complex. The changes in the new Analyzer should not affect
the existing one. The users could add the QO rules and rely on the existing
structures and patterns of the logical plan trees generated by the current
one.  The new Analyzer needs to generate the same logical plan trees as the
current one.


Cheers,

Xiao

Vladimir Golubev <[email protected]> 于2024年8月14日周三 11:27写道：

> - I think we can rely on the current tests. One possibility would be to
> dual-run both Analyzer implementations if `Utils.isTesting` and compare the
> (normalized) logical plans
> - We can implement the analyzer functionality by milestones (Milestone 0:
> Project, Filter, UnresolvedInlineTable, Milestone 2: Support main
> datasources, ...). Running both analyzers in mixed mode may lead to
> unexpected logical plan problems, because that would introduce a completely
> different chain of transformations
>
> On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <[email protected]>
> wrote:
>
>> +1(000) on this!
>>
>> This should massively reduce allocations done in the analyzer, and it is
>> much more efficient. I also can't count the times that I had to increase
>> the number of iterations. This sounds like a no-brainer to me.
>>
>> I do have two questions:
>>
>>    - How do we ensure that we don't accidentally break the analyzer? Are
>>    existing tests enough?
>>    - How can we introduce this incrementally? Can we run the analyzer in
>>    mixed mode (both single pass rules and the existing tree traversal rules)
>>    for a while?
>>
>> Cheers,
>> Herman
>>
>> On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <[email protected]>
>> wrote:
>>
>>> Hello All,
>>>
>>> I recently did some research in the Catalyst Analyzer area to check if
>>> it’s possible to make it single-pass instead of fixed-point. Despite the
>>> flexibility of the current fixed-point approach (new functionality - new
>>> rule), it has some drawbacks. The dependencies between the rules are
>>> unobvious, so it’s hard to introduce changes without having the full
>>> knowledge. By modifying one rule, the whole chain of transformations can
>>> change in an unobvious way. Since we can hit the maximum number of
>>> iterations, there’s no guarantee that the plan is going to be resolved. And
>>> from a performance perspective the Analyzer can be quite slow for large
>>> logical plans and wide tables.
>>>
>>> The idea is to resolve the tree in a single-pass post-order traversal.
>>> This approach can be beneficial for the following reasons:
>>>
>>>    -
>>>
>>>    Code can rely on child nodes being resolved
>>>    -
>>>
>>>    Name visibility can be controlled by a scope, which would be
>>>    maintained during the traversal
>>>    -
>>>
>>>    Random in-tree lookups are disallowed during the traversal, so the
>>>    resolution algorithm would be linear with respect to the parsed tree size
>>>    -
>>>
>>>    Analyzer would deterministically produce resolved logical plan or a
>>>    descriptive error
>>>
>>>
>>> The prototype shows that it’s possible to do a bottom-up Analyzer
>>> implementation and reuse a large chunk of node-processing code from rule
>>> bodies. I did the performance benchmark on two cases where the current
>>> Analyzer struggles the most - wide nested views and large logical plans and
>>> got 7-10x performance speedups.
>>>
>>> Is this something that would be interesting for the Spark community?
>>> What would be the best way to proceed with this idea?
>>>
>>> Regards,
>>>
>>> Vladimir Golubev.
>>>
>>

[DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reply via email to