Re: [VOTE] Single-pass Analyzer for Catalyst

Vladimir Golubev Thu, 03 Oct 2024 11:06:35 -0700

Hi Mich! Thank you for this input.

Yes, this is exactly the approach I would propose too. Putting the new
analyzer under a flag and making the tests pass for both implementations is
crucial. We need to compare the logical (analyzed) plans and ensure that
they are identical.


On Thu, Oct 3, 2024 at 7:41 PM Mich Talebzadeh <[email protected]>
wrote:

>
> With a project manager hat on and having read the SPIP
> <https://issues.apache.org/jira/browse/SPARK-49834>
>
> This proposed single-pass Analyzer framework does potentially offer
> significant long-term benefits in terms of efficiency, maintenance, and
> stability, especially for large or complex queries. However, the rewrite
> involves substantial challenges, including the complexity of the
> transition, the resource cost, and the risk of breaking edge cases or
> existing workflows during the migration period. The key trade-off is
> between the upfront complexity and development time versus the potential
> long-term gains in performance, predictability, and ease of maintenance.
> Phasing the implementation could be an effective method to balance the
> risks and rewards. It allows the community to gradually transition to the
> new framework while mitigating potential disruptions. This is a
> proposal that we can consider
>
> Phase 1: Experimental Opt-In
>    - Introduce the single-pass Analyzer framework as an experimental
> feature.
>    - Allow users to opt-in through a configuration setting say (
> *`spark.sql.analyzer.singlePass.enabled=true*`), so the developers can
> start testing their workflows with it.
>    - Maintain the existing fixed-point Analyzer as the default to ensure
> stability for current users.
>    - Gradually build out coverage for common SQL and DataFrame operations.
>
> Phase 2: Expanded Operator Coverage
>    - Incrementally support more SQL operators, expressions, and DataFrame
> functionality as feedback and testing reveal areas of improvement.
>    - Ensure unit and integration tests run against both frameworks to
> maintain backward compatibility.
>    - Provide detailed documentation and migration guides so users are
> aware of the differences and can adjust their code if needed.
>
> Phase 3: Deprecation of the Fixed-Point Analyzer
>    - Once the single-pass Analyzer has full coverage and has been tested
> extensively in production environments, deprecate the old fixed-point
> framework.
>    - Offer a transition period where both frameworks are supported to give
> users time to adjust.
>
> Phase 4: Full Transition and Removal of Fixed-Point Framework
>    - After sufficient testing and user adoption, make the single-pass
> Analyzer the default and eventually remove the old framework in a future
> major release (say Spark 5.0).
>
> Benefits of Phasing:
> - Risk Mitigation: The phased approach allows gradual adoption, reducing
> the risk of breaking existing workloads or workflows. It ensures there is
> plenty of time for testing and feedback before a full transition.
> - Early User Feedback: Users can test the new framework in their
> environments early and provide feedback, allowing developers to address
> edge cases before it becomes the default.
> - Controlled Rollout: Phasing ensures that any unforeseen issues can be
> addressed incrementally without large disruptions to Spark deployments.
>
> This approach will hopefully ensure a smooth transition to the new
> framework while balancing the trade-offs of complexity, resource
> availability and long-term gains.
>
> HTH
>
> Mich Talebzadeh,
>
> Architect | Data Engineer | Data Science | Financial Crime
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 30 Sept 2024 at 23:38, Reynold Xin <[email protected]>
> wrote:
>
>> I don't actually "lead" this. But I don't think this needs to target a
>> specific Spark version given it should not have any user facing
>> consequences?
>>
>>
>> On Mon, Sep 30, 2024 at 3:36 PM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Thank you for leading this, Vladimir, Reynold, Herman.
>>>
>>> I'm wondering if this is really achievable goal for Apache Spark 4.0.0.
>>>
>>> If it's expected that we are unable to deliver it, shall we postpone
>>> this vote until 4.1.0 planning?
>>>
>>> Anyway, since SPARK-49834 has a target version 4.0.0 explicitly,
>>>
>>> -1 from my side.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On 2024/09/30 17:51:24 Herman van Hovell wrote:
>>> > +1
>>> >
>>> > On Mon, Sep 30, 2024 at 8:29 AM Reynold Xin
>>> <[email protected]>
>>> > wrote:
>>> >
>>> > > +1
>>> > >
>>> > > On Mon, Sep 30, 2024 at 6:47 AM Vladimir Golubev <[email protected]
>>> >
>>> > > wrote:
>>> > >
>>> > >> Hi all,
>>> > >>
>>> > >> I’d like to start a vote for a single-pass Analyzer for the Catalyst
>>> > >> project. This project will introduce a new analysis framework to the
>>> > >> Catalyst, which will eventually replace the fixed-point one.
>>> > >>
>>> > >> Please refer to the SPIP jira:
>>> > >> https://issues.apache.org/jira/browse/SPARK-49834
>>> > >>
>>> > >> [ ] +1: Accept the proposal
>>> > >> [ ] +0
>>> > >> [ ] -1: I don’t think this is a good idea because …
>>> > >>
>>> > >> Thanks!
>>> > >>
>>> > >> Vladimir
>>> > >>
>>> > >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: [VOTE] Single-pass Analyzer for Catalyst

Reply via email to