The issue is not how many lines of code it is, but rather how serious of an
issue it is to have the databricks namespace in Apache code. It's not
a large functional issue, but that doesn't mean that it is only a minor
issue, nor do I think that I would characterize the removal of this
clear error as an unnecessary change. While we strive to avoid making
breaking changes without first giving lengthy notification, that does not
mean that we are prohibited from making a breaking change on short notice
across a major version change. I believe that we would certainly do that
for a significant security or data correctness issue, and I would argue
that the current error is approaching that level of significance. With the
expectation that users shouldn't be using this configuration, the end user
impact of cleaning up our mistake quickly should not be large, so I'm
leaning toward advocating for not carrying the error over into 4.0.0.


On Tue, Feb 18, 2025 at 7:36 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> Hi all,
>
> I’d like to emphasize that a major version release is not a justification
> for unnecessary breaking changes. If we are confident that no one is using
> this configuration, we should clean it up in 3.5.5 as well. However, if
> there’s a possibility that users are already relying on it, then legacy
> support should be in 4.0 as well. We can remove the legacy support after a
> reasonable period, though the exact timeline is not yet defined, but the
> period between 3.5.5 and 4.0.0 likely isn’t long enough.
>
> Personally I think we should keep the legacy support in 4.0 as it's just a
> few lines of code, but if no one uses this conf and you have a strong
> opinion on this cleanup, let's remove the wrongly named conf in both 3.5.5
> and 4.0.0.
>
> On Wed, Feb 19, 2025 at 9:03 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Depends how you want to play this. As usual a cost/benefit analysis will
>> be useful
>>
>> *Immediate Removal in Spark 3.5.5*:
>> pros: Quickly removes the problematic configuration, reducing technical
>> debt and potential issues.
>> cons: Users upgrading directly from earlier versions to Spark 3.5.5 or
>> later will face immediate breakage without a deprecation period.
>>
>> *Deprecation in Spark 3.5.5 and Removal in Spark 4.0.0:*
>> pros: Provides a clear deprecation period, allowing users to migrate
>> their configurations. Reduces the risk of breakage for users who follow
>> recommended upgrade paths.
>> cons: Users who skip versions might still face issues. The depreciation
>> period might be seen as too short if Spark 4.0.0 is released soon after
>> Spark 3.5.5, most probably
>>
>> *Extended Deprecation Period:*
>> pros: Provides a longer migration window, reducing the risk of breakage
>> for users who jump versions.
>> cons: Delays the removal of the configuration, potentially prolonging
>> technical challenges and issues related to the deprecated configuration.
>>
>> *My take*
>>
>>    1. Deprecate in Spark 3.5.5: Introduce deprecation warnings in Spark
>>    3.5.5  that the spark.databricks.* configuration will be removed in Spark
>>    4.0.0.
>>    2. Remove in Spark 4.0.0: Remove the configuration in Spark 4.0.0,
>>    providing a clear upgrade path and sufficient notice for users to migrate.
>>    3. User Communication: Ensure that deprecation warnings are prominent
>>    in the documentation, release notes, and runtime logs. Recommend upgrading
>>    through intermediate versions to avoid breakage.
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Wed, 19 Feb 2025 at 00:41, Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> Though if we are OK with disturbing users to read the migration guide to
>>> figure out the change for the case of direct upgrade to Spark 4.0.0+, I
>>> agree this is also one of the valid ways.
>>>
>>> On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> The point is, why can't we remove it from Spark 3.5.5 as well if we are
>>>> planning to "remove" (not deprecate) at the very next minor release?
>>>>
>>>> The logic of migration just works without having the incorrect config
>>>> key to be indicated with SQL config key.
>>>>
>>>> That said, the point we debate here is only valid when we want to let
>>>> users keep setting the value of the config manually for some time. I'd
>>>> argue users would never set this manually, but let's put this aside for 
>>>> now.
>>>>
>>>> What is the proper "some time" here? If we deprecate this in Spark
>>>> 3.5.5 and remove it to Spark 4.0.0, we are going to break the case of
>>>> setting the config manually when they are upgrading Spark 3.5.4 to Spark
>>>> 4.0.0 directly. I have no idea how we construct the message to recommend
>>>> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough,
>>>> there will always be a case who directly jumps major/minor versions, and
>>>> for them, effectively we do not deprecate the config very well.
>>>>
>>>> I suspect there would be a huge difference if we just remove it in
>>>> Spark 3.5.5 and only support the migration case. That has been something I
>>>> claimed if we really want to kick the incorrect config out ASAP. If we want
>>>> to do this gracefully, I don't feel like removing this in Spark 4.0.0 is
>>>> giving users enough time to indicate the config is deprecated and will be
>>>> removed very soon.
>>>>
>>>> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <holden.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think that removing in 4 sounds reasonable to me as well. It’s
>>>>> important to create a sense of fairness among vendors.
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>
>>>>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if
>>>>>> we release Spark 3.5.5 with the proper deprecation. This is a big
>>>>>> difference, Wenchen.
>>>>>>
>>>>>> And, the difference is the main reason why I initiated this thread to
>>>>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 
>>>>>> via
>>>>>> volunteering Spark 3.5.5 release manager.
>>>>>>
>>>>>> Sincerely,
>>>>>> Dongjoon
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It’s unfortunate that we missed identifying these issues during the
>>>>>>> code review. However, since they have already been released, I believe
>>>>>>> deprecating them is a better approach than removing them, as the latter
>>>>>>> would introduce a breaking change.
>>>>>>>
>>>>>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>,
>>>>>>> it looks like there are only a few lines of migration code. Would it be
>>>>>>> acceptable to leave them for legacy support? With the new config name 
>>>>>>> style
>>>>>>> check rule in place, such issues should not occur again in the future.
>>>>>>>
>>>>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think I can add a color to minimize the concern.
>>>>>>>>
>>>>>>>> The problematic config we added is arguably not user facing. I'd
>>>>>>>> argue moderate users wouldn't even understand what the flag is doing. 
>>>>>>>> The
>>>>>>>> config was added because Structured Streaming has been leveraging SQL
>>>>>>>> config to "do the magic" on having two different default values for new
>>>>>>>> query vs old query (checkpoint is created from the version where the 
>>>>>>>> fix is
>>>>>>>> not landed). This is purely used for backward compatibility, not 
>>>>>>>> something
>>>>>>>> we want to give users flexibility.
>>>>>>>>
>>>>>>>> That said, I don't see a risk of removing config "at any point".
>>>>>>>> (I'd even say removing this config in Spark 3.5.5 does not change 
>>>>>>>> anything.
>>>>>>>> The reason I'm not removing the config in 3.5 (and yet to 4.0/master) 
>>>>>>>> is
>>>>>>>> just to address any concern on being conservative.)
>>>>>>>>
>>>>>>>> I think you are worrying about case 1 from my comment. From my new
>>>>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made
>>>>>>>> a migration logic when the offset log contains the problematic
>>>>>>>> configuration - we will take the value, but put the value to the new
>>>>>>>> config, and at the next microbatch planning, the offset log will 
>>>>>>>> contain
>>>>>>>> the new configuration going forward. This addresses the case 1, as 
>>>>>>>> long as
>>>>>>>> we retain the migration logic for a couple minor releases (say, 4.2 or 
>>>>>>>> so).
>>>>>>>> We just need to support this migration logic for the time where we 
>>>>>>>> never
>>>>>>>> thought of jumping directly from Spark 3.5.4 to the version.
>>>>>>>>
>>>>>>>> Hope this helps to address your concern/worrying.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>>>>>>>> bjornjorgen...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Having breaking changes in a minor seems not that good.. As I'm
>>>>>>>>> reading this,
>>>>>>>>>
>>>>>>>>> "*This could break the query if the rule impacts the query,
>>>>>>>>> because the effectiveness of the fix is flipped.*"
>>>>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What if we have this https://github.com/apache/spark/pull/48149
>>>>>>>>> change in the branch and remove it only for version 4? That way we 
>>>>>>>>> dont
>>>>>>>>> break anything.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> Hi, All.
>>>>>>>>>>
>>>>>>>>>> I'd like to highlight this discussion because this is more
>>>>>>>>>> important and tricky in a way.
>>>>>>>>>>
>>>>>>>>>> As already mentioned in the mailing list and PRs, there was an
>>>>>>>>>> obvious mistake
>>>>>>>>>> which missed an improper configuration name, `spark.databricks.*`.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>>>>>>>>>
>>>>>>>>>> In fact, Apache Spark committers have been preventing this
>>>>>>>>>> repetitive mistake
>>>>>>>>>> pattern during the review stages successfully until we slip the
>>>>>>>>>> following backportings
>>>>>>>>>> at Apache Spark 3.5.4.
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/pull/45649
>>>>>>>>>> https://github.com/apache/spark/pull/48149
>>>>>>>>>> https://github.com/apache/spark/pull/49121
>>>>>>>>>>
>>>>>>>>>> At this point of writing, `spark.databricks.*` was removed
>>>>>>>>>> successfully from `master`
>>>>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect
>>>>>>>>>> Apache Spark repository
>>>>>>>>>> from future mistakes.
>>>>>>>>>>
>>>>>>>>>> SPARK-51172 Rename to
>>>>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>>>>>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>>>>>>>>>>
>>>>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with
>>>>>>>>>> the deprecation
>>>>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>>>>>>>>>> configuration.
>>>>>>>>>>
>>>>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>>>>>>>>>> alternative)
>>>>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*`
>>>>>>>>>> config)
>>>>>>>>>>
>>>>>>>>>> In addition, I'd like to volunteer as a release manager of Apache
>>>>>>>>>> Spark 3.5.5
>>>>>>>>>> for a swift release. WDYT?
>>>>>>>>>>
>>>>>>>>>> FYI, `branch-3.5` has 37 patches currently.
>>>>>>>>>>
>>>>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>>>>>>>>>>       37
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Dongjoon.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Bjørn Jørgensen
>>>>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>>>> Norge
>>>>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>>>>
>>>>>>>>> +47 480 94 297
>>>>>>>>>
>>>>>>>>

Reply via email to