Re: [VOTE] Single-pass Analyzer for Catalyst

Mich Talebzadeh Thu, 03 Oct 2024 10:41:31 -0700

With a project manager hat on and having read the SPIP
<https://issues.apache.org/jira/browse/SPARK-49834>


This proposed single-pass Analyzer framework does potentially offer
significant long-term benefits in terms of efficiency, maintenance, and
stability, especially for large or complex queries. However, the rewrite
involves substantial challenges, including the complexity of the
transition, the resource cost, and the risk of breaking edge cases or
existing workflows during the migration period. The key trade-off is
between the upfront complexity and development time versus the potential
long-term gains in performance, predictability, and ease of maintenance.
Phasing the implementation could be an effective method to balance the
risks and rewards. It allows the community to gradually transition to the
new framework while mitigating potential disruptions. This is a
proposal that we can consider

Phase 1: Experimental Opt-In
   - Introduce the single-pass Analyzer framework as an experimental
feature.
   - Allow users to opt-in through a configuration setting say (
*`spark.sql.analyzer.singlePass.enabled=true*`), so the developers can
start testing their workflows with it.
   - Maintain the existing fixed-point Analyzer as the default to ensure
stability for current users.
   - Gradually build out coverage for common SQL and DataFrame operations.

Phase 2: Expanded Operator Coverage
   - Incrementally support more SQL operators, expressions, and DataFrame
functionality as feedback and testing reveal areas of improvement.
   - Ensure unit and integration tests run against both frameworks to
maintain backward compatibility.
   - Provide detailed documentation and migration guides so users are aware
of the differences and can adjust their code if needed.

Phase 3: Deprecation of the Fixed-Point Analyzer
   - Once the single-pass Analyzer has full coverage and has been tested
extensively in production environments, deprecate the old fixed-point
framework.
   - Offer a transition period where both frameworks are supported to give
users time to adjust.

Phase 4: Full Transition and Removal of Fixed-Point Framework
   - After sufficient testing and user adoption, make the single-pass
Analyzer the default and eventually remove the old framework in a future
major release (say Spark 5.0).

Benefits of Phasing:
- Risk Mitigation: The phased approach allows gradual adoption, reducing
the risk of breaking existing workloads or workflows. It ensures there is
plenty of time for testing and feedback before a full transition.
- Early User Feedback: Users can test the new framework in their
environments early and provide feedback, allowing developers to address
edge cases before it becomes the default.
- Controlled Rollout: Phasing ensures that any unforeseen issues can be
addressed incrementally without large disruptions to Spark deployments.

This approach will hopefully ensure a smooth transition to the new
framework while balancing the trade-offs of complexity, resource
availability and long-term gains.

HTH

Mich Talebzadeh,

Architect | Data Engineer | Data Science | Financial Crime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 30 Sept 2024 at 23:38, Reynold Xin <[email protected]>
wrote:

> I don't actually "lead" this. But I don't think this needs to target a
> specific Spark version given it should not have any user facing
> consequences?
>
>
> On Mon, Sep 30, 2024 at 3:36 PM Dongjoon Hyun <[email protected]> wrote:
>
>> Thank you for leading this, Vladimir, Reynold, Herman.
>>
>> I'm wondering if this is really achievable goal for Apache Spark 4.0.0.
>>
>> If it's expected that we are unable to deliver it, shall we postpone this
>> vote until 4.1.0 planning?
>>
>> Anyway, since SPARK-49834 has a target version 4.0.0 explicitly,
>>
>> -1 from my side.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On 2024/09/30 17:51:24 Herman van Hovell wrote:
>> > +1
>> >
>> > On Mon, Sep 30, 2024 at 8:29 AM Reynold Xin <[email protected]
>> >
>> > wrote:
>> >
>> > > +1
>> > >
>> > > On Mon, Sep 30, 2024 at 6:47 AM Vladimir Golubev <[email protected]>
>> > > wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I’d like to start a vote for a single-pass Analyzer for the Catalyst
>> > >> project. This project will introduce a new analysis framework to the
>> > >> Catalyst, which will eventually replace the fixed-point one.
>> > >>
>> > >> Please refer to the SPIP jira:
>> > >> https://issues.apache.org/jira/browse/SPARK-49834
>> > >>
>> > >> [ ] +1: Accept the proposal
>> > >> [ ] +0
>> > >> [ ] -1: I don’t think this is a good idea because …
>> > >>
>> > >> Thanks!
>> > >>
>> > >> Vladimir
>> > >>
>> > >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [VOTE] Single-pass Analyzer for Catalyst

Reply via email to