Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Jungtaek Lim Wed, 19 Feb 2025 00:16:29 -0800

Looks like we are concerned about having the vendor's name in the config as
much as security concerns or so, although we would make it still be
graceful for a single version line.


If that's the case, I'm OK with removing the config in 4.0.0 and taking the
path of the migration doc, while deprecating the config in Spark 3.5
version line. I think this is not taking the ideal path to deprecate and
eventually remove, but the majority of people in this thread seems to have
a consensus that this case isn't a normal case for sure. I just wanted to
make clear we should never remove the config aggressively like this in
normal cases, which I assume we all agree on.


On Wed, Feb 19, 2025 at 3:20 PM Wenchen Fan <[email protected]> wrote:

> Hi Mark,
>
> If I understand correctly, we are introducing a breaking change in 4.0 by
> removing configs because it is necessary. I’m not suggesting that we are
> violating the rule, just ensuring that there is consensus on this being a
> necessary breaking change, which it seems there is. And yes, this is a
> one-time exception, we are not changing the rule here. I'm simply
> suggesting that we have more discussions about breaking changes on the dev
> list in the future, as exceptions may arise.
>
> I hope I’ve made myself clear this time.
>
> Thanks,
> Wenchen
>
> On Wed, Feb 19, 2025 at 1:59 PM Mark Hamstra <[email protected]>
> wrote:
>
>> This doesn't really have anything to do with a broader approach to
>> breaking changes. Removing the mistake in 4.0.0 does not change our
>> striving to avoid breaking APIs or silently changing behavior --
>> striving is not a guarantee. And the addition of check-in tooling
>> should prevent the issue from recurring. This is very much about the
>> one-time exception, not the rule.
>>
>> On Tue, Feb 18, 2025 at 9:30 PM Wenchen Fan <[email protected]> wrote:
>> >
>> > Hi Dongjoon,
>> >
>> > If this is a policy issue that necessitates a breaking change, then
>> sure, let’s proceed. I don’t have a strong opinion on this specific case,
>> but I’m more concerned with the broader approach to breaking changes.
>> >
>> > I’m referencing this statement from the Spark Version Policy: "The
>> Spark project strives to avoid breaking APIs or silently changing behavior,
>> even in major versions." Defining what constitutes a necessary breaking
>> change can be tricky, and discussions like this on the dev list are
>> valuable for building consensus. Thanks to Dongjoon for starting this
>> discussion, and I encourage all of us to do it for other breaking changes
>> as well.
>> >
>> > Thanks,
>> > Wenchen
>> >
>> > On Wed, Feb 19, 2025 at 12:56 PM Dongjoon Hyun <[email protected]>
>> wrote:
>> >>
>> >> I have different perspectives from Wenchen's opinion in three ways.
>> >>
>> >> > I’d like to emphasize that a major version release is not a
>> justification
>> >> > for unnecessary breaking changes.
>> >> > ...the period between 3.5.5 and 4.0.0 likely isn’t long enough.
>> >>
>> >> First, it's an inevitably necessary change to protect Apache Spark
>> repository from a 3rd-party company's configuration because Apache Spark
>> repository is not owned by that 3rd-party company. We can fix this mistake
>> in Spark 4.
>> >>
>> >> Second, Apache Spark follows the Semantic Versioning. By definition,
>> we can fix this only at Apache Spark 4.0.0. Otherwise, maybe 6 years later
>> with Spark 5?
>> >>
>> >> Third, Apache Spark 3.5.x has an extended support period until April
>> 12th 2026. That's the reason why we should rename the configuration while
>> keeping `spark.databricks.*` as an alternative name for next 14 months.
>> Apache Spark 4.0.0 release is a major-version release which is different
>> from maintenance versions like 3.5.5. Spark 4.0 is not enforcing the
>> customers to migrate immediately. It means the customers have their own
>> timeframe which is independent from Apache Spark release cadence.
>> >>
>> >> Additionally, for the following, although I'm not sure who is *you* in
>> this context, Wenchen's proposal sounds clearly different from what this
>> thread proposed. In other words, I proposed we should not keep the legacy
>> support in 4.0 and we should not remove that config in 3.5.5.
>> >>
>> >> > Personally I think we should keep the legacy support in 4.0 as it's
>> just a
>> >> > few lines of code, but if no one uses this conf and you have a strong
>> >> > opinion on this cleanup, let's remove the wrongly named conf in both
>> 3.5.5
>> >> > and 4.0.0.
>> >>
>> >> To be clear, I'd like to emphasize the followings:
>> >> - It's not about just a few lines of code. It's more like a
>> policy-side issue which we are supposed to keep in Apache Spark repository
>> so far and in the future.
>> >> - All exposed configurations should be considered as used by someone
>> already if there is no evidence.
>> >>
>> >> Sincerely,
>> >> Dongjoon.
>> >>
>> >> On 2025/02/19 03:36:13 Wenchen Fan wrote:
>> >> > Hi all,
>> >> >
>> >> > I’d like to emphasize that a major version release is not a
>> justification
>> >> > for unnecessary breaking changes. If we are confident that no one is
>> using
>> >> > this configuration, we should clean it up in 3.5.5 as well. However,
>> if
>> >> > there’s a possibility that users are already relying on it, then
>> legacy
>> >> > support should be in 4.0 as well. We can remove the legacy support
>> after a
>> >> > reasonable period, though the exact timeline is not yet defined, but
>> the
>> >> > period between 3.5.5 and 4.0.0 likely isn’t long enough.
>> >> >
>> >> > Personally I think we should keep the legacy support in 4.0 as it's
>> just a
>> >> > few lines of code, but if no one uses this conf and you have a strong
>> >> > opinion on this cleanup, let's remove the wrongly named conf in both
>> 3.5.5
>> >> > and 4.0.0.
>> >> >
>> >> > On Wed, Feb 19, 2025 at 9:03 AM Mich Talebzadeh <
>> [email protected]>
>> >> > wrote:
>> >> >
>> >> > > Depends how you want to play this. As usual a cost/benefit
>> analysis will
>> >> > > be useful
>> >> > >
>> >> > > *Immediate Removal in Spark 3.5.5*:
>> >> > > pros: Quickly removes the problematic configuration, reducing
>> technical
>> >> > > debt and potential issues.
>> >> > > cons: Users upgrading directly from earlier versions to Spark
>> 3.5.5 or
>> >> > > later will face immediate breakage without a deprecation period.
>> >> > >
>> >> > > *Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:*
>> >> > > pros: Provides a clear deprecation period, allowing users to
>> migrate their
>> >> > > configurations. Reduces the risk of breakage for users who follow
>> >> > > recommended upgrade paths.
>> >> > > cons: Users who skip versions might still face issues. The
>> depreciation
>> >> > > period might be seen as too short if Spark 4.0.0 is released soon
>> after
>> >> > > Spark 3.5.5, most probably
>> >> > >
>> >> > > *Extended Deprecation Period:*
>> >> > > pros: Provides a longer migration window, reducing the risk of
>> breakage
>> >> > > for users who jump versions.
>> >> > > cons: Delays the removal of the configuration, potentially
>> prolonging
>> >> > > technical challenges and issues related to the deprecated
>> configuration.
>> >> > >
>> >> > > *My take*
>> >> > >
>> >> > >    1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in
>> Spark
>> >> > >    3.5.5  that the spark.databricks.* configuration will be
>> removed in Spark
>> >> > >    4.0.0.
>> >> > >    2. Remove in Spark 4.0.0: Remove the configuration in Spark
>> 4.0.0,
>> >> > >    providing a clear upgrade path and sufficient notice for users
>> to migrate.
>> >> > >    3. User Communication: Ensure that deprecation warnings are
>> prominent
>> >> > >    in the documentation, release notes, and runtime logs.
>> Recommend upgrading
>> >> > >    through intermediate versions to avoid breakage.
>> >> > >
>> >> > > HTH
>> >> > >
>> >> > > Dr Mich Talebzadeh,
>> >> > > Architect | Data Science | Financial Crime | Forensic Analysis |
>> GDPR
>> >> > >
>> >> > >    view my Linkedin profile
>> >> > > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <
>> [email protected]>
>> >> > > wrote:
>> >> > >
>> >> > >> Though if we are OK with disturbing users to read the migration
>> guide to
>> >> > >> figure out the change for the case of direct upgrade to Spark
>> 4.0.0+, I
>> >> > >> agree this is also one of the valid ways.
>> >> > >>
>> >> > >> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <
>> >> > >> [email protected]> wrote:
>> >> > >>
>> >> > >>> The point is, why can't we remove it from Spark 3.5.5 as well if
>> we are
>> >> > >>> planning to "remove" (not deprecate) at the very next minor
>> release?
>> >> > >>>
>> >> > >>> The logic of migration just works without having the incorrect
>> config
>> >> > >>> key to be indicated with SQL config key.
>> >> > >>>
>> >> > >>> That said, the point we debate here is only valid when we want
>> to let
>> >> > >>> users keep setting the value of the config manually for some
>> time. I'd
>> >> > >>> argue users would never set this manually, but let's put this
>> aside for now.
>> >> > >>>
>> >> > >>> What is the proper "some time" here? If we deprecate this in
>> Spark 3.5.5
>> >> > >>> and remove it to Spark 4.0.0, we are going to break the case of
>> setting the
>> >> > >>> config manually when they are upgrading Spark 3.5.4 to Spark
>> 4.0.0
>> >> > >>> directly. I have no idea how we construct the message to
>> recommend
>> >> > >>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not
>> strong enough,
>> >> > >>> there will always be a case who directly jumps major/minor
>> versions, and
>> >> > >>> for them, effectively we do not deprecate the config very well.
>> >> > >>>
>> >> > >>> I suspect there would be a huge difference if we just remove it
>> in Spark
>> >> > >>> 3.5.5 and only support the migration case. That has been
>> something I
>> >> > >>> claimed if we really want to kick the incorrect config out ASAP.
>> If we want
>> >> > >>> to do this gracefully, I don't feel like removing this in Spark
>> 4.0.0 is
>> >> > >>> giving users enough time to indicate the config is deprecated
>> and will be
>> >> > >>> removed very soon.
>> >> > >>>
>> >> > >>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <
>> [email protected]>
>> >> > >>> wrote:
>> >> > >>>
>> >> > >>>> I think that removing in 4 sounds reasonable to me as well. It’s
>> >> > >>>> important to create a sense of fairness among vendors.
>> >> > >>>>
>> >> > >>>> Twitter: https://twitter.com/holdenkarau
>> >> > >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> >> > >>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> >> > >>>> Books (Learning Spark, High Performance Spark, etc.):
>> >> > >>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> >> > >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >> > >>>> Pronouns: she/her
>> >> > >>>>
>> >> > >>>>
>> >> > >>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <
>> [email protected]>
>> >> > >>>> wrote:
>> >> > >>>>
>> >> > >>>>> I don't think there is a reason to keep it at 4.0.0 (and
>> forever?) if
>> >> > >>>>> we release Spark 3.5.5 with the proper deprecation. This is a
>> big
>> >> > >>>>> difference, Wenchen.
>> >> > >>>>>
>> >> > >>>>> And, the difference is the main reason why I initiated this
>> thread to
>> >> > >>>>> sugguest to remove 'spark.databricks.*' completely from Apache
>> Spark 4 via
>> >> > >>>>> volunteering Spark 3.5.5 release manager.
>> >> > >>>>>
>> >> > >>>>> Sincerely,
>> >> > >>>>> Dongjoon
>> >> > >>>>>
>> >> > >>>>>
>> >> > >>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <[email protected]>
>> wrote:
>> >> > >>>>>
>> >> > >>>>>> It’s unfortunate that we missed identifying these issues
>> during the
>> >> > >>>>>> code review. However, since they have already been released,
>> I believe
>> >> > >>>>>> deprecating them is a better approach than removing them, as
>> the latter
>> >> > >>>>>> would introduce a breaking change.
>> >> > >>>>>>
>> >> > >>>>>> Regarding Jungtaek’s PR <
>> https://github.com/apache/spark/pull/49983>,
>> >> > >>>>>> it looks like there are only a few lines of migration code.
>> Would it be
>> >> > >>>>>> acceptable to leave them for legacy support? With the new
>> config name style
>> >> > >>>>>> check rule in place, such issues should not occur again in
>> the future.
>> >> > >>>>>>
>> >> > >>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>> >> > >>>>>> [email protected]> wrote:
>> >> > >>>>>>
>> >> > >>>>>>> I think I can add a color to minimize the concern.
>> >> > >>>>>>>
>> >> > >>>>>>> The problematic config we added is arguably not user facing.
>> I'd
>> >> > >>>>>>> argue moderate users wouldn't even understand what the flag
>> is doing. The
>> >> > >>>>>>> config was added because Structured Streaming has been
>> leveraging SQL
>> >> > >>>>>>> config to "do the magic" on having two different default
>> values for new
>> >> > >>>>>>> query vs old query (checkpoint is created from the version
>> where the fix is
>> >> > >>>>>>> not landed). This is purely used for backward compatibility,
>> not something
>> >> > >>>>>>> we want to give users flexibility.
>> >> > >>>>>>>
>> >> > >>>>>>> That said, I don't see a risk of removing config "at any
>> point".
>> >> > >>>>>>> (I'd even say removing this config in Spark 3.5.5 does not
>> change anything.
>> >> > >>>>>>> The reason I'm not removing the config in 3.5 (and yet to
>> 4.0/master) is
>> >> > >>>>>>> just to address any concern on being conservative.)
>> >> > >>>>>>>
>> >> > >>>>>>> I think you are worrying about case 1 from my comment. From
>> my new
>> >> > >>>>>>> change (link <https://github.com/apache/spark/pull/49983>),
>> I made
>> >> > >>>>>>> a migration logic when the offset log contains the
>> problematic
>> >> > >>>>>>> configuration - we will take the value, but put the value to
>> the new
>> >> > >>>>>>> config, and at the next microbatch planning, the offset log
>> will contain
>> >> > >>>>>>> the new configuration going forward. This addresses the case
>> 1, as long as
>> >> > >>>>>>> we retain the migration logic for a couple minor releases
>> (say, 4.2 or so).
>> >> > >>>>>>> We just need to support this migration logic for the time
>> where we never
>> >> > >>>>>>> thought of jumping directly from Spark 3.5.4 to the version.
>> >> > >>>>>>>
>> >> > >>>>>>> Hope this helps to address your concern/worrying.
>> >> > >>>>>>>
>> >> > >>>>>>>
>> >> > >>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>> >> > >>>>>>> [email protected]> wrote:
>> >> > >>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>> Having breaking changes in a minor seems not that good.. As
>> I'm
>> >> > >>>>>>>> reading this,
>> >> > >>>>>>>>
>> >> > >>>>>>>> "*This could break the query if the rule impacts the query,
>> >> > >>>>>>>> because the effectiveness of the fix is flipped.*"
>> >> > >>>>>>>>
>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>> >> > >>>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>> What if we have this
>> https://github.com/apache/spark/pull/48149
>> >> > >>>>>>>> change in the branch and remove it only for version 4? That
>> way we dont
>> >> > >>>>>>>> break anything.
>> >> > >>>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>> >> > >>>>>>>> [email protected]>:
>> >> > >>>>>>>>
>> >> > >>>>>>>>> Hi, All.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> I'd like to highlight this discussion because this is more
>> >> > >>>>>>>>> important and tricky in a way.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> As already mentioned in the mailing list and PRs, there
>> was an
>> >> > >>>>>>>>> obvious mistake
>> >> > >>>>>>>>> which missed an improper configuration name,
>> `spark.databricks.*`.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>>
>> >> > >>>>>>>>>
>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>> >> > >>>>>>>>>
>> >> > >>>>>>>>>
>> >> > >>>>>>>>>
>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> In fact, Apache Spark committers have been preventing this
>> >> > >>>>>>>>> repetitive mistake
>> >> > >>>>>>>>> pattern during the review stages successfully until we
>> slip the
>> >> > >>>>>>>>> following backportings
>> >> > >>>>>>>>> at Apache Spark 3.5.4.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> https://github.com/apache/spark/pull/45649
>> >> > >>>>>>>>> https://github.com/apache/spark/pull/48149
>> >> > >>>>>>>>> https://github.com/apache/spark/pull/49121
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> At this point of writing, `spark.databricks.*` was removed
>> >> > >>>>>>>>> successfully from `master`
>> >> > >>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to
>> protect
>> >> > >>>>>>>>> Apache Spark repository
>> >> > >>>>>>>>> from future mistakes.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> SPARK-51172 Rename to
>> >> > >>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>> >> > >>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week
>> with
>> >> > >>>>>>>>> the deprecation
>> >> > >>>>>>>>> in order to make Apache Spark 4.0 be free of
>> `spark.databricks.*`
>> >> > >>>>>>>>> configuration.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation
>> warning with
>> >> > >>>>>>>>> alternative)
>> >> > >>>>>>>>> Apache Spark 4.0.0 (2025 March, without
>> `spark.databricks.*`
>> >> > >>>>>>>>> config)
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> In addition, I'd like to volunteer as a release manager of
>> Apache
>> >> > >>>>>>>>> Spark 3.5.5
>> >> > >>>>>>>>> for a swift release. WDYT?
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> FYI, `branch-3.5` has 37 patches currently.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>> >> > >>>>>>>>>       37
>> >> > >>>>>>>>>
>> >> > >>>>>>>>> Best Regards,
>> >> > >>>>>>>>> Dongjoon.
>> >> > >>>>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>>
>> >> > >>>>>>>> --
>> >> > >>>>>>>> Bjørn Jørgensen
>> >> > >>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>> >> > >>>>>>>> <
>> https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g
>> >
>> >> > >>>>>>>> Norge
>> >> > >>>>>>>> <
>> https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g
>> >
>> >> > >>>>>>>>
>> >> > >>>>>>>> +47 480 94 297
>> >> > >>>>>>>>
>> >> > >>>>>>>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to