This doesn't really have anything to do with a broader approach to breaking changes. Removing the mistake in 4.0.0 does not change our striving to avoid breaking APIs or silently changing behavior -- striving is not a guarantee. And the addition of check-in tooling should prevent the issue from recurring. This is very much about the one-time exception, not the rule.
On Tue, Feb 18, 2025 at 9:30 PM Wenchen Fan <cloud0...@gmail.com> wrote: > > Hi Dongjoon, > > If this is a policy issue that necessitates a breaking change, then sure, > let’s proceed. I don’t have a strong opinion on this specific case, but I’m > more concerned with the broader approach to breaking changes. > > I’m referencing this statement from the Spark Version Policy: "The Spark > project strives to avoid breaking APIs or silently changing behavior, even in > major versions." Defining what constitutes a necessary breaking change can be > tricky, and discussions like this on the dev list are valuable for building > consensus. Thanks to Dongjoon for starting this discussion, and I encourage > all of us to do it for other breaking changes as well. > > Thanks, > Wenchen > > On Wed, Feb 19, 2025 at 12:56 PM Dongjoon Hyun <dongj...@apache.org> wrote: >> >> I have different perspectives from Wenchen's opinion in three ways. >> >> > I’d like to emphasize that a major version release is not a justification >> > for unnecessary breaking changes. >> > ...the period between 3.5.5 and 4.0.0 likely isn’t long enough. >> >> First, it's an inevitably necessary change to protect Apache Spark >> repository from a 3rd-party company's configuration because Apache Spark >> repository is not owned by that 3rd-party company. We can fix this mistake >> in Spark 4. >> >> Second, Apache Spark follows the Semantic Versioning. By definition, we can >> fix this only at Apache Spark 4.0.0. Otherwise, maybe 6 years later with >> Spark 5? >> >> Third, Apache Spark 3.5.x has an extended support period until April 12th >> 2026. That's the reason why we should rename the configuration while keeping >> `spark.databricks.*` as an alternative name for next 14 months. Apache Spark >> 4.0.0 release is a major-version release which is different from maintenance >> versions like 3.5.5. Spark 4.0 is not enforcing the customers to migrate >> immediately. It means the customers have their own timeframe which is >> independent from Apache Spark release cadence. >> >> Additionally, for the following, although I'm not sure who is *you* in this >> context, Wenchen's proposal sounds clearly different from what this thread >> proposed. In other words, I proposed we should not keep the legacy support >> in 4.0 and we should not remove that config in 3.5.5. >> >> > Personally I think we should keep the legacy support in 4.0 as it's just a >> > few lines of code, but if no one uses this conf and you have a strong >> > opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5 >> > and 4.0.0. >> >> To be clear, I'd like to emphasize the followings: >> - It's not about just a few lines of code. It's more like a policy-side >> issue which we are supposed to keep in Apache Spark repository so far and in >> the future. >> - All exposed configurations should be considered as used by someone already >> if there is no evidence. >> >> Sincerely, >> Dongjoon. >> >> On 2025/02/19 03:36:13 Wenchen Fan wrote: >> > Hi all, >> > >> > I’d like to emphasize that a major version release is not a justification >> > for unnecessary breaking changes. If we are confident that no one is using >> > this configuration, we should clean it up in 3.5.5 as well. However, if >> > there’s a possibility that users are already relying on it, then legacy >> > support should be in 4.0 as well. We can remove the legacy support after a >> > reasonable period, though the exact timeline is not yet defined, but the >> > period between 3.5.5 and 4.0.0 likely isn’t long enough. >> > >> > Personally I think we should keep the legacy support in 4.0 as it's just a >> > few lines of code, but if no one uses this conf and you have a strong >> > opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5 >> > and 4.0.0. >> > >> > On Wed, Feb 19, 2025 at 9:03 AM Mich Talebzadeh <mich.talebza...@gmail.com> >> > wrote: >> > >> > > Depends how you want to play this. As usual a cost/benefit analysis will >> > > be useful >> > > >> > > *Immediate Removal in Spark 3.5.5*: >> > > pros: Quickly removes the problematic configuration, reducing technical >> > > debt and potential issues. >> > > cons: Users upgrading directly from earlier versions to Spark 3.5.5 or >> > > later will face immediate breakage without a deprecation period. >> > > >> > > *Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:* >> > > pros: Provides a clear deprecation period, allowing users to migrate >> > > their >> > > configurations. Reduces the risk of breakage for users who follow >> > > recommended upgrade paths. >> > > cons: Users who skip versions might still face issues. The depreciation >> > > period might be seen as too short if Spark 4.0.0 is released soon after >> > > Spark 3.5.5, most probably >> > > >> > > *Extended Deprecation Period:* >> > > pros: Provides a longer migration window, reducing the risk of breakage >> > > for users who jump versions. >> > > cons: Delays the removal of the configuration, potentially prolonging >> > > technical challenges and issues related to the deprecated configuration. >> > > >> > > *My take* >> > > >> > > 1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in Spark >> > > 3.5.5 that the spark.databricks.* configuration will be removed in >> > > Spark >> > > 4.0.0. >> > > 2. Remove in Spark 4.0.0: Remove the configuration in Spark 4.0.0, >> > > providing a clear upgrade path and sufficient notice for users to >> > > migrate. >> > > 3. User Communication: Ensure that deprecation warnings are prominent >> > > in the documentation, release notes, and runtime logs. Recommend >> > > upgrading >> > > through intermediate versions to avoid breakage. >> > > >> > > HTH >> > > >> > > Dr Mich Talebzadeh, >> > > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >> > > >> > > view my Linkedin profile >> > > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> > > >> > > >> > > >> > > >> > > >> > > On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <kabhwan.opensou...@gmail.com> >> > > wrote: >> > > >> > >> Though if we are OK with disturbing users to read the migration guide to >> > >> figure out the change for the case of direct upgrade to Spark 4.0.0+, I >> > >> agree this is also one of the valid ways. >> > >> >> > >> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim < >> > >> kabhwan.opensou...@gmail.com> wrote: >> > >> >> > >>> The point is, why can't we remove it from Spark 3.5.5 as well if we are >> > >>> planning to "remove" (not deprecate) at the very next minor release? >> > >>> >> > >>> The logic of migration just works without having the incorrect config >> > >>> key to be indicated with SQL config key. >> > >>> >> > >>> That said, the point we debate here is only valid when we want to let >> > >>> users keep setting the value of the config manually for some time. I'd >> > >>> argue users would never set this manually, but let's put this aside >> > >>> for now. >> > >>> >> > >>> What is the proper "some time" here? If we deprecate this in Spark >> > >>> 3.5.5 >> > >>> and remove it to Spark 4.0.0, we are going to break the case of >> > >>> setting the >> > >>> config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0 >> > >>> directly. I have no idea how we construct the message to recommend >> > >>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong >> > >>> enough, >> > >>> there will always be a case who directly jumps major/minor versions, >> > >>> and >> > >>> for them, effectively we do not deprecate the config very well. >> > >>> >> > >>> I suspect there would be a huge difference if we just remove it in >> > >>> Spark >> > >>> 3.5.5 and only support the migration case. That has been something I >> > >>> claimed if we really want to kick the incorrect config out ASAP. If we >> > >>> want >> > >>> to do this gracefully, I don't feel like removing this in Spark 4.0.0 >> > >>> is >> > >>> giving users enough time to indicate the config is deprecated and will >> > >>> be >> > >>> removed very soon. >> > >>> >> > >>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <holden.ka...@gmail.com> >> > >>> wrote: >> > >>> >> > >>>> I think that removing in 4 sounds reasonable to me as well. It’s >> > >>>> important to create a sense of fairness among vendors. >> > >>>> >> > >>>> Twitter: https://twitter.com/holdenkarau >> > >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> > >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >> > >>>> Books (Learning Spark, High Performance Spark, etc.): >> > >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> > >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > >>>> Pronouns: she/her >> > >>>> >> > >>>> >> > >>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun >> > >>>> <dongjoon.h...@gmail.com> >> > >>>> wrote: >> > >>>> >> > >>>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if >> > >>>>> we release Spark 3.5.5 with the proper deprecation. This is a big >> > >>>>> difference, Wenchen. >> > >>>>> >> > >>>>> And, the difference is the main reason why I initiated this thread to >> > >>>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark >> > >>>>> 4 via >> > >>>>> volunteering Spark 3.5.5 release manager. >> > >>>>> >> > >>>>> Sincerely, >> > >>>>> Dongjoon >> > >>>>> >> > >>>>> >> > >>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> It’s unfortunate that we missed identifying these issues during the >> > >>>>>> code review. However, since they have already been released, I >> > >>>>>> believe >> > >>>>>> deprecating them is a better approach than removing them, as the >> > >>>>>> latter >> > >>>>>> would introduce a breaking change. >> > >>>>>> >> > >>>>>> Regarding Jungtaek’s PR >> > >>>>>> <https://github.com/apache/spark/pull/49983>, >> > >>>>>> it looks like there are only a few lines of migration code. Would >> > >>>>>> it be >> > >>>>>> acceptable to leave them for legacy support? With the new config >> > >>>>>> name style >> > >>>>>> check rule in place, such issues should not occur again in the >> > >>>>>> future. >> > >>>>>> >> > >>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim < >> > >>>>>> kabhwan.opensou...@gmail.com> wrote: >> > >>>>>> >> > >>>>>>> I think I can add a color to minimize the concern. >> > >>>>>>> >> > >>>>>>> The problematic config we added is arguably not user facing. I'd >> > >>>>>>> argue moderate users wouldn't even understand what the flag is >> > >>>>>>> doing. The >> > >>>>>>> config was added because Structured Streaming has been leveraging >> > >>>>>>> SQL >> > >>>>>>> config to "do the magic" on having two different default values >> > >>>>>>> for new >> > >>>>>>> query vs old query (checkpoint is created from the version where >> > >>>>>>> the fix is >> > >>>>>>> not landed). This is purely used for backward compatibility, not >> > >>>>>>> something >> > >>>>>>> we want to give users flexibility. >> > >>>>>>> >> > >>>>>>> That said, I don't see a risk of removing config "at any point". >> > >>>>>>> (I'd even say removing this config in Spark 3.5.5 does not change >> > >>>>>>> anything. >> > >>>>>>> The reason I'm not removing the config in 3.5 (and yet to >> > >>>>>>> 4.0/master) is >> > >>>>>>> just to address any concern on being conservative.) >> > >>>>>>> >> > >>>>>>> I think you are worrying about case 1 from my comment. From my new >> > >>>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made >> > >>>>>>> a migration logic when the offset log contains the problematic >> > >>>>>>> configuration - we will take the value, but put the value to the >> > >>>>>>> new >> > >>>>>>> config, and at the next microbatch planning, the offset log will >> > >>>>>>> contain >> > >>>>>>> the new configuration going forward. This addresses the case 1, as >> > >>>>>>> long as >> > >>>>>>> we retain the migration logic for a couple minor releases (say, >> > >>>>>>> 4.2 or so). >> > >>>>>>> We just need to support this migration logic for the time where we >> > >>>>>>> never >> > >>>>>>> thought of jumping directly from Spark 3.5.4 to the version. >> > >>>>>>> >> > >>>>>>> Hope this helps to address your concern/worrying. >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen < >> > >>>>>>> bjornjorgen...@gmail.com> wrote: >> > >>>>>>> >> > >>>>>>>> >> > >>>>>>>> Having breaking changes in a minor seems not that good.. As I'm >> > >>>>>>>> reading this, >> > >>>>>>>> >> > >>>>>>>> "*This could break the query if the rule impacts the query, >> > >>>>>>>> because the effectiveness of the fix is flipped.*" >> > >>>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140 >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> What if we have this https://github.com/apache/spark/pull/48149 >> > >>>>>>>> change in the branch and remove it only for version 4? That way >> > >>>>>>>> we dont >> > >>>>>>>> break anything. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun < >> > >>>>>>>> dongjoon.h...@gmail.com>: >> > >>>>>>>> >> > >>>>>>>>> Hi, All. >> > >>>>>>>>> >> > >>>>>>>>> I'd like to highlight this discussion because this is more >> > >>>>>>>>> important and tricky in a way. >> > >>>>>>>>> >> > >>>>>>>>> As already mentioned in the mailing list and PRs, there was an >> > >>>>>>>>> obvious mistake >> > >>>>>>>>> which missed an improper configuration name, >> > >>>>>>>>> `spark.databricks.*`. >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424 >> > >>>>>>>>> >> > >>>>>>>>> >> > >>>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` >> > >>>>>>>>> >> > >>>>>>>>> In fact, Apache Spark committers have been preventing this >> > >>>>>>>>> repetitive mistake >> > >>>>>>>>> pattern during the review stages successfully until we slip the >> > >>>>>>>>> following backportings >> > >>>>>>>>> at Apache Spark 3.5.4. >> > >>>>>>>>> >> > >>>>>>>>> https://github.com/apache/spark/pull/45649 >> > >>>>>>>>> https://github.com/apache/spark/pull/48149 >> > >>>>>>>>> https://github.com/apache/spark/pull/49121 >> > >>>>>>>>> >> > >>>>>>>>> At this point of writing, `spark.databricks.*` was removed >> > >>>>>>>>> successfully from `master` >> > >>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect >> > >>>>>>>>> Apache Spark repository >> > >>>>>>>>> from future mistakes. >> > >>>>>>>>> >> > >>>>>>>>> SPARK-51172 Rename to >> > >>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan >> > >>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule >> > >>>>>>>>> >> > >>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with >> > >>>>>>>>> the deprecation >> > >>>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*` >> > >>>>>>>>> configuration. >> > >>>>>>>>> >> > >>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with >> > >>>>>>>>> alternative) >> > >>>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` >> > >>>>>>>>> config) >> > >>>>>>>>> >> > >>>>>>>>> In addition, I'd like to volunteer as a release manager of Apache >> > >>>>>>>>> Spark 3.5.5 >> > >>>>>>>>> for a swift release. WDYT? >> > >>>>>>>>> >> > >>>>>>>>> FYI, `branch-3.5` has 37 patches currently. >> > >>>>>>>>> >> > >>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l >> > >>>>>>>>> 37 >> > >>>>>>>>> >> > >>>>>>>>> Best Regards, >> > >>>>>>>>> Dongjoon. >> > >>>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> -- >> > >>>>>>>> Bjørn Jørgensen >> > >>>>>>>> Vestre Aspehaug 4, 6010 Ålesund >> > >>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >> > >>>>>>>> Norge >> > >>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >> > >>>>>>>> >> > >>>>>>>> +47 480 94 297 >> > >>>>>>>> >> > >>>>>>> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org