With a project manager hat on and having read the SPIP <https://issues.apache.org/jira/browse/SPARK-49834>
This proposed single-pass Analyzer framework does potentially offer significant long-term benefits in terms of efficiency, maintenance, and stability, especially for large or complex queries. However, the rewrite involves substantial challenges, including the complexity of the transition, the resource cost, and the risk of breaking edge cases or existing workflows during the migration period. The key trade-off is between the upfront complexity and development time versus the potential long-term gains in performance, predictability, and ease of maintenance. Phasing the implementation could be an effective method to balance the risks and rewards. It allows the community to gradually transition to the new framework while mitigating potential disruptions. This is a proposal that we can consider Phase 1: Experimental Opt-In - Introduce the single-pass Analyzer framework as an experimental feature. - Allow users to opt-in through a configuration setting say ( *`spark.sql.analyzer.singlePass.enabled=true*`), so the developers can start testing their workflows with it. - Maintain the existing fixed-point Analyzer as the default to ensure stability for current users. - Gradually build out coverage for common SQL and DataFrame operations. Phase 2: Expanded Operator Coverage - Incrementally support more SQL operators, expressions, and DataFrame functionality as feedback and testing reveal areas of improvement. - Ensure unit and integration tests run against both frameworks to maintain backward compatibility. - Provide detailed documentation and migration guides so users are aware of the differences and can adjust their code if needed. Phase 3: Deprecation of the Fixed-Point Analyzer - Once the single-pass Analyzer has full coverage and has been tested extensively in production environments, deprecate the old fixed-point framework. - Offer a transition period where both frameworks are supported to give users time to adjust. Phase 4: Full Transition and Removal of Fixed-Point Framework - After sufficient testing and user adoption, make the single-pass Analyzer the default and eventually remove the old framework in a future major release (say Spark 5.0). Benefits of Phasing: - Risk Mitigation: The phased approach allows gradual adoption, reducing the risk of breaking existing workloads or workflows. It ensures there is plenty of time for testing and feedback before a full transition. - Early User Feedback: Users can test the new framework in their environments early and provide feedback, allowing developers to address edge cases before it becomes the default. - Controlled Rollout: Phasing ensures that any unforeseen issues can be addressed incrementally without large disruptions to Spark deployments. This approach will hopefully ensure a smooth transition to the new framework while balancing the trade-offs of complexity, resource availability and long-term gains. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 30 Sept 2024 at 23:38, Reynold Xin <r...@databricks.com.invalid> wrote: > I don't actually "lead" this. But I don't think this needs to target a > specific Spark version given it should not have any user facing > consequences? > > > On Mon, Sep 30, 2024 at 3:36 PM Dongjoon Hyun <dongj...@apache.org> wrote: > >> Thank you for leading this, Vladimir, Reynold, Herman. >> >> I'm wondering if this is really achievable goal for Apache Spark 4.0.0. >> >> If it's expected that we are unable to deliver it, shall we postpone this >> vote until 4.1.0 planning? >> >> Anyway, since SPARK-49834 has a target version 4.0.0 explicitly, >> >> -1 from my side. >> >> Thanks, >> Dongjoon. >> >> >> On 2024/09/30 17:51:24 Herman van Hovell wrote: >> > +1 >> > >> > On Mon, Sep 30, 2024 at 8:29 AM Reynold Xin <r...@databricks.com.invalid >> > >> > wrote: >> > >> > > +1 >> > > >> > > On Mon, Sep 30, 2024 at 6:47 AM Vladimir Golubev <vvdr....@gmail.com> >> > > wrote: >> > > >> > >> Hi all, >> > >> >> > >> I’d like to start a vote for a single-pass Analyzer for the Catalyst >> > >> project. This project will introduce a new analysis framework to the >> > >> Catalyst, which will eventually replace the fixed-point one. >> > >> >> > >> Please refer to the SPIP jira: >> > >> https://issues.apache.org/jira/browse/SPARK-49834 >> > >> >> > >> [ ] +1: Accept the proposal >> > >> [ ] +0 >> > >> [ ] -1: I don’t think this is a good idea because … >> > >> >> > >> Thanks! >> > >> >> > >> Vladimir >> > >> >> > > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>