Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Jungtaek Lim Tue, 18 Feb 2025 16:40:38 -0800

Though if we are OK with disturbing users to read the migration guide to
figure out the change for the case of direct upgrade to Spark 4.0.0+, I
agree this is also one of the valid ways.


On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <[email protected]>
wrote:

> The point is, why can't we remove it from Spark 3.5.5 as well if we are
> planning to "remove" (not deprecate) at the very next minor release?
>
> The logic of migration just works without having the incorrect config key
> to be indicated with SQL config key.
>
> That said, the point we debate here is only valid when we want to let
> users keep setting the value of the config manually for some time. I'd
> argue users would never set this manually, but let's put this aside for now.
>
> What is the proper "some time" here? If we deprecate this in Spark 3.5.5
> and remove it to Spark 4.0.0, we are going to break the case of setting the
> config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0
> directly. I have no idea how we construct the message to recommend
> upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough,
> there will always be a case who directly jumps major/minor versions, and
> for them, effectively we do not deprecate the config very well.
>
> I suspect there would be a huge difference if we just remove it in Spark
> 3.5.5 and only support the migration case. That has been something I
> claimed if we really want to kick the incorrect config out ASAP. If we want
> to do this gracefully, I don't feel like removing this in Spark 4.0.0 is
> giving users enough time to indicate the config is deprecated and will be
> removed very soon.
>
> On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <[email protected]>
> wrote:
>
>> I think that removing in 4 sounds reasonable to me as well. It’s
>> important to create a sense of fairness among vendors.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if we
>>> release Spark 3.5.5 with the proper deprecation. This is a big difference,
>>> Wenchen.
>>>
>>> And, the difference is the main reason why I initiated this thread to
>>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via
>>> volunteering Spark 3.5.5 release manager.
>>>
>>> Sincerely,
>>> Dongjoon
>>>
>>>
>>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <[email protected]> wrote:
>>>
>>>> It’s unfortunate that we missed identifying these issues during the
>>>> code review. However, since they have already been released, I believe
>>>> deprecating them is a better approach than removing them, as the latter
>>>> would introduce a breaking change.
>>>>
>>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>,
>>>> it looks like there are only a few lines of migration code. Would it be
>>>> acceptable to leave them for legacy support? With the new config name style
>>>> check rule in place, such issues should not occur again in the future.
>>>>
>>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>>>> [email protected]> wrote:
>>>>
>>>>> I think I can add a color to minimize the concern.
>>>>>
>>>>> The problematic config we added is arguably not user facing. I'd argue
>>>>> moderate users wouldn't even understand what the flag is doing. The config
>>>>> was added because Structured Streaming has been leveraging SQL config to
>>>>> "do the magic" on having two different default values for new query vs old
>>>>> query (checkpoint is created from the version where the fix is not 
>>>>> landed).
>>>>> This is purely used for backward compatibility, not something we want to
>>>>> give users flexibility.
>>>>>
>>>>> That said, I don't see a risk of removing config "at any point". (I'd
>>>>> even say removing this config in Spark 3.5.5 does not change anything. The
>>>>> reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just
>>>>> to address any concern on being conservative.)
>>>>>
>>>>> I think you are worrying about case 1 from my comment. From my new
>>>>> change (link <https://github.com/apache/spark/pull/49983>), I made a
>>>>> migration logic when the offset log contains the problematic configuration
>>>>> - we will take the value, but put the value to the new config, and at the
>>>>> next microbatch planning, the offset log will contain the new 
>>>>> configuration
>>>>> going forward. This addresses the case 1, as long as we retain the
>>>>> migration logic for a couple minor releases (say, 4.2 or so). We just need
>>>>> to support this migration logic for the time where we never thought of
>>>>> jumping directly from Spark 3.5.4 to the version.
>>>>>
>>>>> Hope this helps to address your concern/worrying.
>>>>>
>>>>>
>>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>> Having breaking changes in a minor seems not that good.. As I'm
>>>>>> reading this,
>>>>>>
>>>>>> "*This could break the query if the rule impacts the query, because
>>>>>> the effectiveness of the fix is flipped.*"
>>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>>>>>>
>>>>>>
>>>>>> What if we have this https://github.com/apache/spark/pull/48149
>>>>>> change in the branch and remove it only for version 4? That way we dont
>>>>>> break anything.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>>>>>> [email protected]>:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> I'd like to highlight this discussion because this is more important
>>>>>>> and tricky in a way.
>>>>>>>
>>>>>>> As already mentioned in the mailing list and PRs, there was an
>>>>>>> obvious mistake
>>>>>>> which missed an improper configuration name, `spark.databricks.*`.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>>>>>>
>>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>>>>>>
>>>>>>> In fact, Apache Spark committers have been preventing this
>>>>>>> repetitive mistake
>>>>>>> pattern during the review stages successfully until we slip the
>>>>>>> following backportings
>>>>>>> at Apache Spark 3.5.4.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/45649
>>>>>>> https://github.com/apache/spark/pull/48149
>>>>>>> https://github.com/apache/spark/pull/49121
>>>>>>>
>>>>>>> At this point of writing, `spark.databricks.*` was removed
>>>>>>> successfully from `master`
>>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect
>>>>>>> Apache Spark repository
>>>>>>> from future mistakes.
>>>>>>>
>>>>>>> SPARK-51172 Rename to
>>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>>>>>>>
>>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with the
>>>>>>> deprecation
>>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>>>>>>> configuration.
>>>>>>>
>>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>>>>>>> alternative)
>>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config)
>>>>>>>
>>>>>>> In addition, I'd like to volunteer as a release manager of Apache
>>>>>>> Spark 3.5.5
>>>>>>> for a swift release. WDYT?
>>>>>>>
>>>>>>> FYI, `branch-3.5` has 37 patches currently.
>>>>>>>
>>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>>>>>>>       37
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>> Norge
>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to