Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Jungtaek Lim Tue, 18 Feb 2025 16:21:31 -0800

The point is, why can't we remove it from Spark 3.5.5 as well if we are
planning to "remove" (not deprecate) at the very next minor release?


The logic of migration just works without having the incorrect config key
to be indicated with SQL config key.

That said, the point we debate here is only valid when we want to let users
keep setting the value of the config manually for some time. I'd argue
users would never set this manually, but let's put this aside for now.

What is the proper "some time" here? If we deprecate this in Spark 3.5.5
and remove it to Spark 4.0.0, we are going to break the case of setting the
config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0
directly. I have no idea how we construct the message to recommend
upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough,
there will always be a case who directly jumps major/minor versions, and
for them, effectively we do not deprecate the config very well.

I suspect there would be a huge difference if we just remove it in Spark
3.5.5 and only support the migration case. That has been something I
claimed if we really want to kick the incorrect config out ASAP. If we want
to do this gracefully, I don't feel like removing this in Spark 4.0.0 is
giving users enough time to indicate the config is deprecated and will be
removed very soon.

On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <[email protected]> wrote:

> I think that removing in 4 sounds reasonable to me as well. It’s important
> to create a sense of fairness among vendors.
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <[email protected]>
> wrote:
>
>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if we
>> release Spark 3.5.5 with the proper deprecation. This is a big difference,
>> Wenchen.
>>
>> And, the difference is the main reason why I initiated this thread to
>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via
>> volunteering Spark 3.5.5 release manager.
>>
>> Sincerely,
>> Dongjoon
>>
>>
>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <[email protected]> wrote:
>>
>>> It’s unfortunate that we missed identifying these issues during the code
>>> review. However, since they have already been released, I believe
>>> deprecating them is a better approach than removing them, as the latter
>>> would introduce a breaking change.
>>>
>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>,
>>> it looks like there are only a few lines of migration code. Would it be
>>> acceptable to leave them for legacy support? With the new config name style
>>> check rule in place, such issues should not occur again in the future.
>>>
>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>>> [email protected]> wrote:
>>>
>>>> I think I can add a color to minimize the concern.
>>>>
>>>> The problematic config we added is arguably not user facing. I'd argue
>>>> moderate users wouldn't even understand what the flag is doing. The config
>>>> was added because Structured Streaming has been leveraging SQL config to
>>>> "do the magic" on having two different default values for new query vs old
>>>> query (checkpoint is created from the version where the fix is not landed).
>>>> This is purely used for backward compatibility, not something we want to
>>>> give users flexibility.
>>>>
>>>> That said, I don't see a risk of removing config "at any point". (I'd
>>>> even say removing this config in Spark 3.5.5 does not change anything. The
>>>> reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just
>>>> to address any concern on being conservative.)
>>>>
>>>> I think you are worrying about case 1 from my comment. From my new
>>>> change (link <https://github.com/apache/spark/pull/49983>), I made a
>>>> migration logic when the offset log contains the problematic configuration
>>>> - we will take the value, but put the value to the new config, and at the
>>>> next microbatch planning, the offset log will contain the new configuration
>>>> going forward. This addresses the case 1, as long as we retain the
>>>> migration logic for a couple minor releases (say, 4.2 or so). We just need
>>>> to support this migration logic for the time where we never thought of
>>>> jumping directly from Spark 3.5.4 to the version.
>>>>
>>>> Hope this helps to address your concern/worrying.
>>>>
>>>>
>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>> Having breaking changes in a minor seems not that good.. As I'm
>>>>> reading this,
>>>>>
>>>>> "*This could break the query if the rule impacts the query, because
>>>>> the effectiveness of the fix is flipped.*"
>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>>>>>
>>>>>
>>>>> What if we have this https://github.com/apache/spark/pull/48149
>>>>> change in the branch and remove it only for version 4? That way we dont
>>>>> break anything.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>>>>> [email protected]>:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> I'd like to highlight this discussion because this is more important
>>>>>> and tricky in a way.
>>>>>>
>>>>>> As already mentioned in the mailing list and PRs, there was an
>>>>>> obvious mistake
>>>>>> which missed an improper configuration name, `spark.databricks.*`.
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>>>>>
>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>>>>>
>>>>>> In fact, Apache Spark committers have been preventing this repetitive
>>>>>> mistake
>>>>>> pattern during the review stages successfully until we slip the
>>>>>> following backportings
>>>>>> at Apache Spark 3.5.4.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/45649
>>>>>> https://github.com/apache/spark/pull/48149
>>>>>> https://github.com/apache/spark/pull/49121
>>>>>>
>>>>>> At this point of writing, `spark.databricks.*` was removed
>>>>>> successfully from `master`
>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect
>>>>>> Apache Spark repository
>>>>>> from future mistakes.
>>>>>>
>>>>>> SPARK-51172 Rename to
>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>>>>>> SPARK-51173 Add `configName` Scalastyle rule
>>>>>>
>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with the
>>>>>> deprecation
>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>>>>>> configuration.
>>>>>>
>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>>>>>> alternative)
>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config)
>>>>>>
>>>>>> In addition, I'd like to volunteer as a release manager of Apache
>>>>>> Spark 3.5.5
>>>>>> for a swift release. WDYT?
>>>>>>
>>>>>> FYI, `branch-3.5` has 37 patches currently.
>>>>>>
>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>>>>>>       37
>>>>>>
>>>>>> Best Regards,
>>>>>> Dongjoon.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>> Norge
>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to