Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Mark Hamstra Tue, 18 Feb 2025 22:23:10 -0800

This doesn't really have anything to do with a broader approach to
breaking changes. Removing the mistake in 4.0.0 does not change our
striving to avoid breaking APIs or silently changing behavior --
striving is not a guarantee. And the addition of check-in tooling
should prevent the issue from recurring. This is very much about the
one-time exception, not the rule.


On Tue, Feb 18, 2025 at 9:30 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
> Hi Dongjoon,
>
> If this is a policy issue that necessitates a breaking change, then sure, 
> let’s proceed. I don’t have a strong opinion on this specific case, but I’m 
> more concerned with the broader approach to breaking changes.
>
> I’m referencing this statement from the Spark Version Policy: "The Spark 
> project strives to avoid breaking APIs or silently changing behavior, even in 
> major versions." Defining what constitutes a necessary breaking change can be 
> tricky, and discussions like this on the dev list are valuable for building 
> consensus. Thanks to Dongjoon for starting this discussion, and I encourage 
> all of us to do it for other breaking changes as well.
>
> Thanks,
> Wenchen
>
> On Wed, Feb 19, 2025 at 12:56 PM Dongjoon Hyun <dongj...@apache.org> wrote:
>>
>> I have different perspectives from Wenchen's opinion in three ways.
>>
>> > I’d like to emphasize that a major version release is not a justification
>> > for unnecessary breaking changes.
>> > ...the period between 3.5.5 and 4.0.0 likely isn’t long enough.
>>
>> First, it's an inevitably necessary change to protect Apache Spark 
>> repository from a 3rd-party company's configuration because Apache Spark 
>> repository is not owned by that 3rd-party company. We can fix this mistake 
>> in Spark 4.
>>
>> Second, Apache Spark follows the Semantic Versioning. By definition, we can 
>> fix this only at Apache Spark 4.0.0. Otherwise, maybe 6 years later with 
>> Spark 5?
>>
>> Third, Apache Spark 3.5.x has an extended support period until April 12th 
>> 2026. That's the reason why we should rename the configuration while keeping 
>> `spark.databricks.*` as an alternative name for next 14 months. Apache Spark 
>> 4.0.0 release is a major-version release which is different from maintenance 
>> versions like 3.5.5. Spark 4.0 is not enforcing the customers to migrate 
>> immediately. It means the customers have their own timeframe which is 
>> independent from Apache Spark release cadence.
>>
>> Additionally, for the following, although I'm not sure who is *you* in this 
>> context, Wenchen's proposal sounds clearly different from what this thread 
>> proposed. In other words, I proposed we should not keep the legacy support 
>> in 4.0 and we should not remove that config in 3.5.5.
>>
>> > Personally I think we should keep the legacy support in 4.0 as it's just a
>> > few lines of code, but if no one uses this conf and you have a strong
>> > opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5
>> > and 4.0.0.
>>
>> To be clear, I'd like to emphasize the followings:
>> - It's not about just a few lines of code. It's more like a policy-side 
>> issue which we are supposed to keep in Apache Spark repository so far and in 
>> the future.
>> - All exposed configurations should be considered as used by someone already 
>> if there is no evidence.
>>
>> Sincerely,
>> Dongjoon.
>>
>> On 2025/02/19 03:36:13 Wenchen Fan wrote:
>> > Hi all,
>> >
>> > I’d like to emphasize that a major version release is not a justification
>> > for unnecessary breaking changes. If we are confident that no one is using
>> > this configuration, we should clean it up in 3.5.5 as well. However, if
>> > there’s a possibility that users are already relying on it, then legacy
>> > support should be in 4.0 as well. We can remove the legacy support after a
>> > reasonable period, though the exact timeline is not yet defined, but the
>> > period between 3.5.5 and 4.0.0 likely isn’t long enough.
>> >
>> > Personally I think we should keep the legacy support in 4.0 as it's just a
>> > few lines of code, but if no one uses this conf and you have a strong
>> > opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5
>> > and 4.0.0.
>> >
>> > On Wed, Feb 19, 2025 at 9:03 AM Mich Talebzadeh <mich.talebza...@gmail.com>
>> > wrote:
>> >
>> > > Depends how you want to play this. As usual a cost/benefit analysis will
>> > > be useful
>> > >
>> > > *Immediate Removal in Spark 3.5.5*:
>> > > pros: Quickly removes the problematic configuration, reducing technical
>> > > debt and potential issues.
>> > > cons: Users upgrading directly from earlier versions to Spark 3.5.5 or
>> > > later will face immediate breakage without a deprecation period.
>> > >
>> > > *Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:*
>> > > pros: Provides a clear deprecation period, allowing users to migrate 
>> > > their
>> > > configurations. Reduces the risk of breakage for users who follow
>> > > recommended upgrade paths.
>> > > cons: Users who skip versions might still face issues. The depreciation
>> > > period might be seen as too short if Spark 4.0.0 is released soon after
>> > > Spark 3.5.5, most probably
>> > >
>> > > *Extended Deprecation Period:*
>> > > pros: Provides a longer migration window, reducing the risk of breakage
>> > > for users who jump versions.
>> > > cons: Delays the removal of the configuration, potentially prolonging
>> > > technical challenges and issues related to the deprecated configuration.
>> > >
>> > > *My take*
>> > >
>> > >    1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in Spark
>> > >    3.5.5  that the spark.databricks.* configuration will be removed in 
>> > > Spark
>> > >    4.0.0.
>> > >    2. Remove in Spark 4.0.0: Remove the configuration in Spark 4.0.0,
>> > >    providing a clear upgrade path and sufficient notice for users to 
>> > > migrate.
>> > >    3. User Communication: Ensure that deprecation warnings are prominent
>> > >    in the documentation, release notes, and runtime logs. Recommend 
>> > > upgrading
>> > >    through intermediate versions to avoid breakage.
>> > >
>> > > HTH
>> > >
>> > > Dr Mich Talebzadeh,
>> > > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>> > >
>> > >    view my Linkedin profile
>> > > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> > > wrote:
>> > >
>> > >> Though if we are OK with disturbing users to read the migration guide to
>> > >> figure out the change for the case of direct upgrade to Spark 4.0.0+, I
>> > >> agree this is also one of the valid ways.
>> > >>
>> > >> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <
>> > >> kabhwan.opensou...@gmail.com> wrote:
>> > >>
>> > >>> The point is, why can't we remove it from Spark 3.5.5 as well if we are
>> > >>> planning to "remove" (not deprecate) at the very next minor release?
>> > >>>
>> > >>> The logic of migration just works without having the incorrect config
>> > >>> key to be indicated with SQL config key.
>> > >>>
>> > >>> That said, the point we debate here is only valid when we want to let
>> > >>> users keep setting the value of the config manually for some time. I'd
>> > >>> argue users would never set this manually, but let's put this aside 
>> > >>> for now.
>> > >>>
>> > >>> What is the proper "some time" here? If we deprecate this in Spark 
>> > >>> 3.5.5
>> > >>> and remove it to Spark 4.0.0, we are going to break the case of 
>> > >>> setting the
>> > >>> config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0
>> > >>> directly. I have no idea how we construct the message to recommend
>> > >>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong 
>> > >>> enough,
>> > >>> there will always be a case who directly jumps major/minor versions, 
>> > >>> and
>> > >>> for them, effectively we do not deprecate the config very well.
>> > >>>
>> > >>> I suspect there would be a huge difference if we just remove it in 
>> > >>> Spark
>> > >>> 3.5.5 and only support the migration case. That has been something I
>> > >>> claimed if we really want to kick the incorrect config out ASAP. If we 
>> > >>> want
>> > >>> to do this gracefully, I don't feel like removing this in Spark 4.0.0 
>> > >>> is
>> > >>> giving users enough time to indicate the config is deprecated and will 
>> > >>> be
>> > >>> removed very soon.
>> > >>>
>> > >>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <holden.ka...@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>>> I think that removing in 4 sounds reasonable to me as well. It’s
>> > >>>> important to create a sense of fairness among vendors.
>> > >>>>
>> > >>>> Twitter: https://twitter.com/holdenkarau
>> > >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> > >>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> > >>>> Books (Learning Spark, High Performance Spark, etc.):
>> > >>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> > >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> > >>>> Pronouns: she/her
>> > >>>>
>> > >>>>
>> > >>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun 
>> > >>>> <dongjoon.h...@gmail.com>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if
>> > >>>>> we release Spark 3.5.5 with the proper deprecation. This is a big
>> > >>>>> difference, Wenchen.
>> > >>>>>
>> > >>>>> And, the difference is the main reason why I initiated this thread to
>> > >>>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 
>> > >>>>> 4 via
>> > >>>>> volunteering Spark 3.5.5 release manager.
>> > >>>>>
>> > >>>>> Sincerely,
>> > >>>>> Dongjoon
>> > >>>>>
>> > >>>>>
>> > >>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com> 
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> It’s unfortunate that we missed identifying these issues during the
>> > >>>>>> code review. However, since they have already been released, I 
>> > >>>>>> believe
>> > >>>>>> deprecating them is a better approach than removing them, as the 
>> > >>>>>> latter
>> > >>>>>> would introduce a breaking change.
>> > >>>>>>
>> > >>>>>> Regarding Jungtaek’s PR 
>> > >>>>>> <https://github.com/apache/spark/pull/49983>,
>> > >>>>>> it looks like there are only a few lines of migration code. Would 
>> > >>>>>> it be
>> > >>>>>> acceptable to leave them for legacy support? With the new config 
>> > >>>>>> name style
>> > >>>>>> check rule in place, such issues should not occur again in the 
>> > >>>>>> future.
>> > >>>>>>
>> > >>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>> > >>>>>> kabhwan.opensou...@gmail.com> wrote:
>> > >>>>>>
>> > >>>>>>> I think I can add a color to minimize the concern.
>> > >>>>>>>
>> > >>>>>>> The problematic config we added is arguably not user facing. I'd
>> > >>>>>>> argue moderate users wouldn't even understand what the flag is 
>> > >>>>>>> doing. The
>> > >>>>>>> config was added because Structured Streaming has been leveraging 
>> > >>>>>>> SQL
>> > >>>>>>> config to "do the magic" on having two different default values 
>> > >>>>>>> for new
>> > >>>>>>> query vs old query (checkpoint is created from the version where 
>> > >>>>>>> the fix is
>> > >>>>>>> not landed). This is purely used for backward compatibility, not 
>> > >>>>>>> something
>> > >>>>>>> we want to give users flexibility.
>> > >>>>>>>
>> > >>>>>>> That said, I don't see a risk of removing config "at any point".
>> > >>>>>>> (I'd even say removing this config in Spark 3.5.5 does not change 
>> > >>>>>>> anything.
>> > >>>>>>> The reason I'm not removing the config in 3.5 (and yet to 
>> > >>>>>>> 4.0/master) is
>> > >>>>>>> just to address any concern on being conservative.)
>> > >>>>>>>
>> > >>>>>>> I think you are worrying about case 1 from my comment. From my new
>> > >>>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made
>> > >>>>>>> a migration logic when the offset log contains the problematic
>> > >>>>>>> configuration - we will take the value, but put the value to the 
>> > >>>>>>> new
>> > >>>>>>> config, and at the next microbatch planning, the offset log will 
>> > >>>>>>> contain
>> > >>>>>>> the new configuration going forward. This addresses the case 1, as 
>> > >>>>>>> long as
>> > >>>>>>> we retain the migration logic for a couple minor releases (say, 
>> > >>>>>>> 4.2 or so).
>> > >>>>>>> We just need to support this migration logic for the time where we 
>> > >>>>>>> never
>> > >>>>>>> thought of jumping directly from Spark 3.5.4 to the version.
>> > >>>>>>>
>> > >>>>>>> Hope this helps to address your concern/worrying.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>> > >>>>>>> bjornjorgen...@gmail.com> wrote:
>> > >>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Having breaking changes in a minor seems not that good.. As I'm
>> > >>>>>>>> reading this,
>> > >>>>>>>>
>> > >>>>>>>> "*This could break the query if the rule impacts the query,
>> > >>>>>>>> because the effectiveness of the fix is flipped.*"
>> > >>>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> What if we have this https://github.com/apache/spark/pull/48149
>> > >>>>>>>> change in the branch and remove it only for version 4? That way 
>> > >>>>>>>> we dont
>> > >>>>>>>> break anything.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>> > >>>>>>>> dongjoon.h...@gmail.com>:
>> > >>>>>>>>
>> > >>>>>>>>> Hi, All.
>> > >>>>>>>>>
>> > >>>>>>>>> I'd like to highlight this discussion because this is more
>> > >>>>>>>>> important and tricky in a way.
>> > >>>>>>>>>
>> > >>>>>>>>> As already mentioned in the mailing list and PRs, there was an
>> > >>>>>>>>> obvious mistake
>> > >>>>>>>>> which missed an improper configuration name, 
>> > >>>>>>>>> `spark.databricks.*`.
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>> > >>>>>>>>>
>> > >>>>>>>>> In fact, Apache Spark committers have been preventing this
>> > >>>>>>>>> repetitive mistake
>> > >>>>>>>>> pattern during the review stages successfully until we slip the
>> > >>>>>>>>> following backportings
>> > >>>>>>>>> at Apache Spark 3.5.4.
>> > >>>>>>>>>
>> > >>>>>>>>> https://github.com/apache/spark/pull/45649
>> > >>>>>>>>> https://github.com/apache/spark/pull/48149
>> > >>>>>>>>> https://github.com/apache/spark/pull/49121
>> > >>>>>>>>>
>> > >>>>>>>>> At this point of writing, `spark.databricks.*` was removed
>> > >>>>>>>>> successfully from `master`
>> > >>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect
>> > >>>>>>>>> Apache Spark repository
>> > >>>>>>>>> from future mistakes.
>> > >>>>>>>>>
>> > >>>>>>>>> SPARK-51172 Rename to
>> > >>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>> > >>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>> > >>>>>>>>>
>> > >>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with
>> > >>>>>>>>> the deprecation
>> > >>>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>> > >>>>>>>>> configuration.
>> > >>>>>>>>>
>> > >>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>> > >>>>>>>>> alternative)
>> > >>>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*`
>> > >>>>>>>>> config)
>> > >>>>>>>>>
>> > >>>>>>>>> In addition, I'd like to volunteer as a release manager of Apache
>> > >>>>>>>>> Spark 3.5.5
>> > >>>>>>>>> for a swift release. WDYT?
>> > >>>>>>>>>
>> > >>>>>>>>> FYI, `branch-3.5` has 37 patches currently.
>> > >>>>>>>>>
>> > >>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>> > >>>>>>>>>       37
>> > >>>>>>>>>
>> > >>>>>>>>> Best Regards,
>> > >>>>>>>>> Dongjoon.
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> --
>> > >>>>>>>> Bjørn Jørgensen
>> > >>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>> > >>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>> > >>>>>>>> Norge
>> > >>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>> > >>>>>>>>
>> > >>>>>>>> +47 480 94 297
>> > >>>>>>>>
>> > >>>>>>>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to