Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Wenchen Fan Tue, 18 Feb 2025 19:36:42 -0800

Hi all,

I’d like to emphasize that a major version release is not a justification
for unnecessary breaking changes. If we are confident that no one is using
this configuration, we should clean it up in 3.5.5 as well. However, if
there’s a possibility that users are already relying on it, then legacy
support should be in 4.0 as well. We can remove the legacy support after a
reasonable period, though the exact timeline is not yet defined, but the
period between 3.5.5 and 4.0.0 likely isn’t long enough.


Personally I think we should keep the legacy support in 4.0 as it's just a
few lines of code, but if no one uses this conf and you have a strong
opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5
and 4.0.0.

On Wed, Feb 19, 2025 at 9:03 AM Mich Talebzadeh <[email protected]>
wrote:

> Depends how you want to play this. As usual a cost/benefit analysis will
> be useful
>
> *Immediate Removal in Spark 3.5.5*:
> pros: Quickly removes the problematic configuration, reducing technical
> debt and potential issues.
> cons: Users upgrading directly from earlier versions to Spark 3.5.5 or
> later will face immediate breakage without a deprecation period.
>
> *Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:*
> pros: Provides a clear deprecation period, allowing users to migrate their
> configurations. Reduces the risk of breakage for users who follow
> recommended upgrade paths.
> cons: Users who skip versions might still face issues. The depreciation
> period might be seen as too short if Spark 4.0.0 is released soon after
> Spark 3.5.5, most probably
>
> *Extended Deprecation Period:*
> pros: Provides a longer migration window, reducing the risk of breakage
> for users who jump versions.
> cons: Delays the removal of the configuration, potentially prolonging
> technical challenges and issues related to the deprecated configuration.
>
> *My take*
>
>    1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in Spark
>    3.5.5  that the spark.databricks.* configuration will be removed in Spark
>    4.0.0.
>    2. Remove in Spark 4.0.0: Remove the configuration in Spark 4.0.0,
>    providing a clear upgrade path and sufficient notice for users to migrate.
>    3. User Communication: Ensure that deprecation warnings are prominent
>    in the documentation, release notes, and runtime logs. Recommend upgrading
>    through intermediate versions to avoid breakage.
>
> HTH
>
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <[email protected]>
> wrote:
>
>> Though if we are OK with disturbing users to read the migration guide to
>> figure out the change for the case of direct upgrade to Spark 4.0.0+, I
>> agree this is also one of the valid ways.
>>
>> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <
>> [email protected]> wrote:
>>
>>> The point is, why can't we remove it from Spark 3.5.5 as well if we are
>>> planning to "remove" (not deprecate) at the very next minor release?
>>>
>>> The logic of migration just works without having the incorrect config
>>> key to be indicated with SQL config key.
>>>
>>> That said, the point we debate here is only valid when we want to let
>>> users keep setting the value of the config manually for some time. I'd
>>> argue users would never set this manually, but let's put this aside for now.
>>>
>>> What is the proper "some time" here? If we deprecate this in Spark 3.5.5
>>> and remove it to Spark 4.0.0, we are going to break the case of setting the
>>> config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0
>>> directly. I have no idea how we construct the message to recommend
>>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough,
>>> there will always be a case who directly jumps major/minor versions, and
>>> for them, effectively we do not deprecate the config very well.
>>>
>>> I suspect there would be a huge difference if we just remove it in Spark
>>> 3.5.5 and only support the migration case. That has been something I
>>> claimed if we really want to kick the incorrect config out ASAP. If we want
>>> to do this gracefully, I don't feel like removing this in Spark 4.0.0 is
>>> giving users enough time to indicate the config is deprecated and will be
>>> removed very soon.
>>>
>>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> I think that removing in 4 sounds reasonable to me as well. It’s
>>>> important to create a sense of fairness among vendors.
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if
>>>>> we release Spark 3.5.5 with the proper deprecation. This is a big
>>>>> difference, Wenchen.
>>>>>
>>>>> And, the difference is the main reason why I initiated this thread to
>>>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via
>>>>> volunteering Spark 3.5.5 release manager.
>>>>>
>>>>> Sincerely,
>>>>> Dongjoon
>>>>>
>>>>>
>>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <[email protected]> wrote:
>>>>>
>>>>>> It’s unfortunate that we missed identifying these issues during the
>>>>>> code review. However, since they have already been released, I believe
>>>>>> deprecating them is a better approach than removing them, as the latter
>>>>>> would introduce a breaking change.
>>>>>>
>>>>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>,
>>>>>> it looks like there are only a few lines of migration code. Would it be
>>>>>> acceptable to leave them for legacy support? With the new config name 
>>>>>> style
>>>>>> check rule in place, such issues should not occur again in the future.
>>>>>>
>>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think I can add a color to minimize the concern.
>>>>>>>
>>>>>>> The problematic config we added is arguably not user facing. I'd
>>>>>>> argue moderate users wouldn't even understand what the flag is doing. 
>>>>>>> The
>>>>>>> config was added because Structured Streaming has been leveraging SQL
>>>>>>> config to "do the magic" on having two different default values for new
>>>>>>> query vs old query (checkpoint is created from the version where the 
>>>>>>> fix is
>>>>>>> not landed). This is purely used for backward compatibility, not 
>>>>>>> something
>>>>>>> we want to give users flexibility.
>>>>>>>
>>>>>>> That said, I don't see a risk of removing config "at any point".
>>>>>>> (I'd even say removing this config in Spark 3.5.5 does not change 
>>>>>>> anything.
>>>>>>> The reason I'm not removing the config in 3.5 (and yet to 4.0/master) is
>>>>>>> just to address any concern on being conservative.)
>>>>>>>
>>>>>>> I think you are worrying about case 1 from my comment. From my new
>>>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made
>>>>>>> a migration logic when the offset log contains the problematic
>>>>>>> configuration - we will take the value, but put the value to the new
>>>>>>> config, and at the next microbatch planning, the offset log will contain
>>>>>>> the new configuration going forward. This addresses the case 1, as long 
>>>>>>> as
>>>>>>> we retain the migration logic for a couple minor releases (say, 4.2 or 
>>>>>>> so).
>>>>>>> We just need to support this migration logic for the time where we never
>>>>>>> thought of jumping directly from Spark 3.5.4 to the version.
>>>>>>>
>>>>>>> Hope this helps to address your concern/worrying.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Having breaking changes in a minor seems not that good.. As I'm
>>>>>>>> reading this,
>>>>>>>>
>>>>>>>> "*This could break the query if the rule impacts the query,
>>>>>>>> because the effectiveness of the fix is flipped.*"
>>>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>>>>>>>>
>>>>>>>>
>>>>>>>> What if we have this https://github.com/apache/spark/pull/48149
>>>>>>>> change in the branch and remove it only for version 4? That way we dont
>>>>>>>> break anything.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>>>>>>>> [email protected]>:
>>>>>>>>
>>>>>>>>> Hi, All.
>>>>>>>>>
>>>>>>>>> I'd like to highlight this discussion because this is more
>>>>>>>>> important and tricky in a way.
>>>>>>>>>
>>>>>>>>> As already mentioned in the mailing list and PRs, there was an
>>>>>>>>> obvious mistake
>>>>>>>>> which missed an improper configuration name, `spark.databricks.*`.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>>>>>>>>
>>>>>>>>> In fact, Apache Spark committers have been preventing this
>>>>>>>>> repetitive mistake
>>>>>>>>> pattern during the review stages successfully until we slip the
>>>>>>>>> following backportings
>>>>>>>>> at Apache Spark 3.5.4.
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/pull/45649
>>>>>>>>> https://github.com/apache/spark/pull/48149
>>>>>>>>> https://github.com/apache/spark/pull/49121
>>>>>>>>>
>>>>>>>>> At this point of writing, `spark.databricks.*` was removed
>>>>>>>>> successfully from `master`
>>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect
>>>>>>>>> Apache Spark repository
>>>>>>>>> from future mistakes.
>>>>>>>>>
>>>>>>>>> SPARK-51172 Rename to
>>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>>>>>>>>>
>>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with
>>>>>>>>> the deprecation
>>>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>>>>>>>>> configuration.
>>>>>>>>>
>>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>>>>>>>>> alternative)
>>>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*`
>>>>>>>>> config)
>>>>>>>>>
>>>>>>>>> In addition, I'd like to volunteer as a release manager of Apache
>>>>>>>>> Spark 3.5.5
>>>>>>>>> for a swift release. WDYT?
>>>>>>>>>
>>>>>>>>> FYI, `branch-3.5` has 37 patches currently.
>>>>>>>>>
>>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>>>>>>>>>       37
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bjørn Jørgensen
>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>>> Norge
>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>>>
>>>>>>>> +47 480 94 297
>>>>>>>>
>>>>>>>

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to