Hi all, I’d like to emphasize that a major version release is not a justification for unnecessary breaking changes. If we are confident that no one is using this configuration, we should clean it up in 3.5.5 as well. However, if there’s a possibility that users are already relying on it, then legacy support should be in 4.0 as well. We can remove the legacy support after a reasonable period, though the exact timeline is not yet defined, but the period between 3.5.5 and 4.0.0 likely isn’t long enough.
Personally I think we should keep the legacy support in 4.0 as it's just a few lines of code, but if no one uses this conf and you have a strong opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5 and 4.0.0. On Wed, Feb 19, 2025 at 9:03 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Depends how you want to play this. As usual a cost/benefit analysis will > be useful > > *Immediate Removal in Spark 3.5.5*: > pros: Quickly removes the problematic configuration, reducing technical > debt and potential issues. > cons: Users upgrading directly from earlier versions to Spark 3.5.5 or > later will face immediate breakage without a deprecation period. > > *Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:* > pros: Provides a clear deprecation period, allowing users to migrate their > configurations. Reduces the risk of breakage for users who follow > recommended upgrade paths. > cons: Users who skip versions might still face issues. The depreciation > period might be seen as too short if Spark 4.0.0 is released soon after > Spark 3.5.5, most probably > > *Extended Deprecation Period:* > pros: Provides a longer migration window, reducing the risk of breakage > for users who jump versions. > cons: Delays the removal of the configuration, potentially prolonging > technical challenges and issues related to the deprecated configuration. > > *My take* > > 1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in Spark > 3.5.5 that the spark.databricks.* configuration will be removed in Spark > 4.0.0. > 2. Remove in Spark 4.0.0: Remove the configuration in Spark 4.0.0, > providing a clear upgrade path and sufficient notice for users to migrate. > 3. User Communication: Ensure that deprecation warnings are prominent > in the documentation, release notes, and runtime logs. Recommend upgrading > through intermediate versions to avoid breakage. > > HTH > > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> Though if we are OK with disturbing users to read the migration guide to >> figure out the change for the case of direct upgrade to Spark 4.0.0+, I >> agree this is also one of the valid ways. >> >> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> The point is, why can't we remove it from Spark 3.5.5 as well if we are >>> planning to "remove" (not deprecate) at the very next minor release? >>> >>> The logic of migration just works without having the incorrect config >>> key to be indicated with SQL config key. >>> >>> That said, the point we debate here is only valid when we want to let >>> users keep setting the value of the config manually for some time. I'd >>> argue users would never set this manually, but let's put this aside for now. >>> >>> What is the proper "some time" here? If we deprecate this in Spark 3.5.5 >>> and remove it to Spark 4.0.0, we are going to break the case of setting the >>> config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0 >>> directly. I have no idea how we construct the message to recommend >>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough, >>> there will always be a case who directly jumps major/minor versions, and >>> for them, effectively we do not deprecate the config very well. >>> >>> I suspect there would be a huge difference if we just remove it in Spark >>> 3.5.5 and only support the migration case. That has been something I >>> claimed if we really want to kick the incorrect config out ASAP. If we want >>> to do this gracefully, I don't feel like removing this in Spark 4.0.0 is >>> giving users enough time to indicate the config is deprecated and will be >>> removed very soon. >>> >>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> I think that removing in 4 sounds reasonable to me as well. It’s >>>> important to create a sense of fairness among vendors. >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if >>>>> we release Spark 3.5.5 with the proper deprecation. This is a big >>>>> difference, Wenchen. >>>>> >>>>> And, the difference is the main reason why I initiated this thread to >>>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via >>>>> volunteering Spark 3.5.5 release manager. >>>>> >>>>> Sincerely, >>>>> Dongjoon >>>>> >>>>> >>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com> wrote: >>>>> >>>>>> It’s unfortunate that we missed identifying these issues during the >>>>>> code review. However, since they have already been released, I believe >>>>>> deprecating them is a better approach than removing them, as the latter >>>>>> would introduce a breaking change. >>>>>> >>>>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>, >>>>>> it looks like there are only a few lines of migration code. Would it be >>>>>> acceptable to leave them for legacy support? With the new config name >>>>>> style >>>>>> check rule in place, such issues should not occur again in the future. >>>>>> >>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim < >>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> I think I can add a color to minimize the concern. >>>>>>> >>>>>>> The problematic config we added is arguably not user facing. I'd >>>>>>> argue moderate users wouldn't even understand what the flag is doing. >>>>>>> The >>>>>>> config was added because Structured Streaming has been leveraging SQL >>>>>>> config to "do the magic" on having two different default values for new >>>>>>> query vs old query (checkpoint is created from the version where the >>>>>>> fix is >>>>>>> not landed). This is purely used for backward compatibility, not >>>>>>> something >>>>>>> we want to give users flexibility. >>>>>>> >>>>>>> That said, I don't see a risk of removing config "at any point". >>>>>>> (I'd even say removing this config in Spark 3.5.5 does not change >>>>>>> anything. >>>>>>> The reason I'm not removing the config in 3.5 (and yet to 4.0/master) is >>>>>>> just to address any concern on being conservative.) >>>>>>> >>>>>>> I think you are worrying about case 1 from my comment. From my new >>>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made >>>>>>> a migration logic when the offset log contains the problematic >>>>>>> configuration - we will take the value, but put the value to the new >>>>>>> config, and at the next microbatch planning, the offset log will contain >>>>>>> the new configuration going forward. This addresses the case 1, as long >>>>>>> as >>>>>>> we retain the migration logic for a couple minor releases (say, 4.2 or >>>>>>> so). >>>>>>> We just need to support this migration logic for the time where we never >>>>>>> thought of jumping directly from Spark 3.5.4 to the version. >>>>>>> >>>>>>> Hope this helps to address your concern/worrying. >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen < >>>>>>> bjornjorgen...@gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> Having breaking changes in a minor seems not that good.. As I'm >>>>>>>> reading this, >>>>>>>> >>>>>>>> "*This could break the query if the rule impacts the query, >>>>>>>> because the effectiveness of the fix is flipped.*" >>>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140 >>>>>>>> >>>>>>>> >>>>>>>> What if we have this https://github.com/apache/spark/pull/48149 >>>>>>>> change in the branch and remove it only for version 4? That way we dont >>>>>>>> break anything. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun < >>>>>>>> dongjoon.h...@gmail.com>: >>>>>>>> >>>>>>>>> Hi, All. >>>>>>>>> >>>>>>>>> I'd like to highlight this discussion because this is more >>>>>>>>> important and tricky in a way. >>>>>>>>> >>>>>>>>> As already mentioned in the mailing list and PRs, there was an >>>>>>>>> obvious mistake >>>>>>>>> which missed an improper configuration name, `spark.databricks.*`. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424 >>>>>>>>> >>>>>>>>> >>>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` >>>>>>>>> >>>>>>>>> In fact, Apache Spark committers have been preventing this >>>>>>>>> repetitive mistake >>>>>>>>> pattern during the review stages successfully until we slip the >>>>>>>>> following backportings >>>>>>>>> at Apache Spark 3.5.4. >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/pull/45649 >>>>>>>>> https://github.com/apache/spark/pull/48149 >>>>>>>>> https://github.com/apache/spark/pull/49121 >>>>>>>>> >>>>>>>>> At this point of writing, `spark.databricks.*` was removed >>>>>>>>> successfully from `master` >>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect >>>>>>>>> Apache Spark repository >>>>>>>>> from future mistakes. >>>>>>>>> >>>>>>>>> SPARK-51172 Rename to >>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan >>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule >>>>>>>>> >>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with >>>>>>>>> the deprecation >>>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*` >>>>>>>>> configuration. >>>>>>>>> >>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with >>>>>>>>> alternative) >>>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` >>>>>>>>> config) >>>>>>>>> >>>>>>>>> In addition, I'd like to volunteer as a release manager of Apache >>>>>>>>> Spark 3.5.5 >>>>>>>>> for a swift release. WDYT? >>>>>>>>> >>>>>>>>> FYI, `branch-3.5` has 37 patches currently. >>>>>>>>> >>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l >>>>>>>>> 37 >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Dongjoon. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Bjørn Jørgensen >>>>>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>>>>>>> Norge >>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>>>>>>> >>>>>>>> +47 480 94 297 >>>>>>>> >>>>>>>