Depends how you want to play this. As usual a cost/benefit analysis will be
useful

*Immediate Removal in Spark 3.5.5*:
pros: Quickly removes the problematic configuration, reducing technical
debt and potential issues.
cons: Users upgrading directly from earlier versions to Spark 3.5.5 or
later will face immediate breakage without a deprecation period.

*Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:*
pros: Provides a clear deprecation period, allowing users to migrate their
configurations. Reduces the risk of breakage for users who follow
recommended upgrade paths.
cons: Users who skip versions might still face issues. The depreciation
period might be seen as too short if Spark 4.0.0 is released soon after
Spark 3.5.5, most probably

*Extended Deprecation Period:*
pros: Provides a longer migration window, reducing the risk of breakage for
users who jump versions.
cons: Delays the removal of the configuration, potentially prolonging
technical challenges and issues related to the deprecated configuration.

*My take*

   1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in Spark
   3.5.5  that the spark.databricks.* configuration will be removed in Spark
   4.0.0.
   2. Remove in Spark 4.0.0: Remove the configuration in Spark 4.0.0,
   providing a clear upgrade path and sufficient notice for users to migrate.
   3. User Communication: Ensure that deprecation warnings are prominent in
   the documentation, release notes, and runtime logs. Recommend upgrading
   through intermediate versions to avoid breakage.

HTH

Dr Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>





On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Though if we are OK with disturbing users to read the migration guide to
> figure out the change for the case of direct upgrade to Spark 4.0.0+, I
> agree this is also one of the valid ways.
>
> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> The point is, why can't we remove it from Spark 3.5.5 as well if we are
>> planning to "remove" (not deprecate) at the very next minor release?
>>
>> The logic of migration just works without having the incorrect config key
>> to be indicated with SQL config key.
>>
>> That said, the point we debate here is only valid when we want to let
>> users keep setting the value of the config manually for some time. I'd
>> argue users would never set this manually, but let's put this aside for now.
>>
>> What is the proper "some time" here? If we deprecate this in Spark 3.5.5
>> and remove it to Spark 4.0.0, we are going to break the case of setting the
>> config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0
>> directly. I have no idea how we construct the message to recommend
>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough,
>> there will always be a case who directly jumps major/minor versions, and
>> for them, effectively we do not deprecate the config very well.
>>
>> I suspect there would be a huge difference if we just remove it in Spark
>> 3.5.5 and only support the migration case. That has been something I
>> claimed if we really want to kick the incorrect config out ASAP. If we want
>> to do this gracefully, I don't feel like removing this in Spark 4.0.0 is
>> giving users enough time to indicate the config is deprecated and will be
>> removed very soon.
>>
>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <holden.ka...@gmail.com>
>> wrote:
>>
>>> I think that removing in 4 sounds reasonable to me as well. It’s
>>> important to create a sense of fairness among vendors.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if
>>>> we release Spark 3.5.5 with the proper deprecation. This is a big
>>>> difference, Wenchen.
>>>>
>>>> And, the difference is the main reason why I initiated this thread to
>>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via
>>>> volunteering Spark 3.5.5 release manager.
>>>>
>>>> Sincerely,
>>>> Dongjoon
>>>>
>>>>
>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> It’s unfortunate that we missed identifying these issues during the
>>>>> code review. However, since they have already been released, I believe
>>>>> deprecating them is a better approach than removing them, as the latter
>>>>> would introduce a breaking change.
>>>>>
>>>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>,
>>>>> it looks like there are only a few lines of migration code. Would it be
>>>>> acceptable to leave them for legacy support? With the new config name 
>>>>> style
>>>>> check rule in place, such issues should not occur again in the future.
>>>>>
>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> I think I can add a color to minimize the concern.
>>>>>>
>>>>>> The problematic config we added is arguably not user facing. I'd
>>>>>> argue moderate users wouldn't even understand what the flag is doing. The
>>>>>> config was added because Structured Streaming has been leveraging SQL
>>>>>> config to "do the magic" on having two different default values for new
>>>>>> query vs old query (checkpoint is created from the version where the fix 
>>>>>> is
>>>>>> not landed). This is purely used for backward compatibility, not 
>>>>>> something
>>>>>> we want to give users flexibility.
>>>>>>
>>>>>> That said, I don't see a risk of removing config "at any point". (I'd
>>>>>> even say removing this config in Spark 3.5.5 does not change anything. 
>>>>>> The
>>>>>> reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just
>>>>>> to address any concern on being conservative.)
>>>>>>
>>>>>> I think you are worrying about case 1 from my comment. From my new
>>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made a
>>>>>> migration logic when the offset log contains the problematic 
>>>>>> configuration
>>>>>> - we will take the value, but put the value to the new config, and at the
>>>>>> next microbatch planning, the offset log will contain the new 
>>>>>> configuration
>>>>>> going forward. This addresses the case 1, as long as we retain the
>>>>>> migration logic for a couple minor releases (say, 4.2 or so). We just 
>>>>>> need
>>>>>> to support this migration logic for the time where we never thought of
>>>>>> jumping directly from Spark 3.5.4 to the version.
>>>>>>
>>>>>> Hope this helps to address your concern/worrying.
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Having breaking changes in a minor seems not that good.. As I'm
>>>>>>> reading this,
>>>>>>>
>>>>>>> "*This could break the query if the rule impacts the query, because
>>>>>>> the effectiveness of the fix is flipped.*"
>>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>>>>>>>
>>>>>>>
>>>>>>> What if we have this https://github.com/apache/spark/pull/48149
>>>>>>> change in the branch and remove it only for version 4? That way we dont
>>>>>>> break anything.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com>:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> I'd like to highlight this discussion because this is more
>>>>>>>> important and tricky in a way.
>>>>>>>>
>>>>>>>> As already mentioned in the mailing list and PRs, there was an
>>>>>>>> obvious mistake
>>>>>>>> which missed an improper configuration name, `spark.databricks.*`.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>>>>>>>
>>>>>>>>
>>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>>>>>>>
>>>>>>>> In fact, Apache Spark committers have been preventing this
>>>>>>>> repetitive mistake
>>>>>>>> pattern during the review stages successfully until we slip the
>>>>>>>> following backportings
>>>>>>>> at Apache Spark 3.5.4.
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/45649
>>>>>>>> https://github.com/apache/spark/pull/48149
>>>>>>>> https://github.com/apache/spark/pull/49121
>>>>>>>>
>>>>>>>> At this point of writing, `spark.databricks.*` was removed
>>>>>>>> successfully from `master`
>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect
>>>>>>>> Apache Spark repository
>>>>>>>> from future mistakes.
>>>>>>>>
>>>>>>>> SPARK-51172 Rename to
>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>>>>>>>>
>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with the
>>>>>>>> deprecation
>>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>>>>>>>> configuration.
>>>>>>>>
>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>>>>>>>> alternative)
>>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config)
>>>>>>>>
>>>>>>>> In addition, I'd like to volunteer as a release manager of Apache
>>>>>>>> Spark 3.5.5
>>>>>>>> for a swift release. WDYT?
>>>>>>>>
>>>>>>>> FYI, `branch-3.5` has 37 patches currently.
>>>>>>>>
>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>>>>>>>>       37
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bjørn Jørgensen
>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>> Norge
>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>>
>>>>>>> +47 480 94 297
>>>>>>>
>>>>>>

Reply via email to