Re: [DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Wenchen Fan Mon, 26 Aug 2024 06:48:26 -0700

+1. The analyzer rule order issue has bitten me multiple times and it's
very hard to make your analyzer rule bug-free if it interacts with other
rules.


On Wed, Aug 21, 2024 at 2:49 AM Reynold Xin <r...@databricks.com.invalid>
wrote:

> +1 on this too
>
> When I implemented "group by all", I introduced at least two subtle bugs
> that many reviewers weren't able to catch and those two bugs would not have
> been possible to introduce if we had a single pass analyzer. Single pass
> can make the whole framework more robust.
>
>
>
>
>
>
> On Tue, Aug 20, 2024 at 7:23 PM Xiao Li <gatorsm...@gmail.com> wrote:
>
>> This sounds like a good idea!
>>
>> The Analyzer is complex. The changes in the new Analyzer should not
>> affect the existing one. The users could add the QO rules and rely on the
>> existing structures and patterns of the logical plan trees generated by the
>> current one.  The new Analyzer needs to generate the same logical plan
>> trees as the current one.
>>
>> Cheers,
>>
>> Xiao
>>
>> Vladimir Golubev <vvdr....@gmail.com> 于2024年8月14日周三 11:27写道：
>>
>>> - I think we can rely on the current tests. One possibility would be to
>>> dual-run both Analyzer implementations if `Utils.isTesting` and compare the
>>> (normalized) logical plans
>>> - We can implement the analyzer functionality by milestones (Milestone
>>> 0: Project, Filter, UnresolvedInlineTable, Milestone 2: Support main
>>> datasources, ...). Running both analyzers in mixed mode may lead to
>>> unexpected logical plan problems, because that would introduce a completely
>>> different chain of transformations
>>>
>>> On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <her...@databricks.com>
>>> wrote:
>>>
>>>> +1(000) on this!
>>>>
>>>> This should massively reduce allocations done in the analyzer, and it
>>>> is much more efficient. I also can't count the times that I had to increase
>>>> the number of iterations. This sounds like a no-brainer to me.
>>>>
>>>> I do have two questions:
>>>>
>>>>    - How do we ensure that we don't accidentally break the analyzer?
>>>>    Are existing tests enough?
>>>>    - How can we introduce this incrementally? Can we run the analyzer
>>>>    in mixed mode (both single pass rules and the existing tree traversal
>>>>    rules) for a while?
>>>>
>>>> Cheers,
>>>> Herman
>>>>
>>>> On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <vvdr....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I recently did some research in the Catalyst Analyzer area to check if
>>>>> it’s possible to make it single-pass instead of fixed-point. Despite the
>>>>> flexibility of the current fixed-point approach (new functionality - new
>>>>> rule), it has some drawbacks. The dependencies between the rules are
>>>>> unobvious, so it’s hard to introduce changes without having the full
>>>>> knowledge. By modifying one rule, the whole chain of transformations can
>>>>> change in an unobvious way. Since we can hit the maximum number of
>>>>> iterations, there’s no guarantee that the plan is going to be resolved. 
>>>>> And
>>>>> from a performance perspective the Analyzer can be quite slow for large
>>>>> logical plans and wide tables.
>>>>>
>>>>> The idea is to resolve the tree in a single-pass post-order traversal.
>>>>> This approach can be beneficial for the following reasons:
>>>>>
>>>>>    -
>>>>>
>>>>>    Code can rely on child nodes being resolved
>>>>>    -
>>>>>
>>>>>    Name visibility can be controlled by a scope, which would be
>>>>>    maintained during the traversal
>>>>>    -
>>>>>
>>>>>    Random in-tree lookups are disallowed during the traversal, so the
>>>>>    resolution algorithm would be linear with respect to the parsed tree 
>>>>> size
>>>>>    -
>>>>>
>>>>>    Analyzer would deterministically produce resolved logical plan or
>>>>>    a descriptive error
>>>>>
>>>>>
>>>>> The prototype shows that it’s possible to do a bottom-up Analyzer
>>>>> implementation and reuse a large chunk of node-processing code from rule
>>>>> bodies. I did the performance benchmark on two cases where the current
>>>>> Analyzer struggles the most - wide nested views and large logical plans 
>>>>> and
>>>>> got 7-10x performance speedups.
>>>>>
>>>>> Is this something that would be interesting for the Spark community?
>>>>> What would be the best way to proceed with this idea?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Vladimir Golubev.
>>>>>
>>>>

Re: [DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reply via email to