Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Holden Karau Tue, 18 Feb 2025 09:25:26 -0800

I think that removing in 4 sounds reasonable to me as well. It’s important
to create a sense of fairness among vendors.


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <[email protected]>
wrote:

> I don't think there is a reason to keep it at 4.0.0 (and forever?) if we
> release Spark 3.5.5 with the proper deprecation. This is a big difference,
> Wenchen.
>
> And, the difference is the main reason why I initiated this thread to
> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via
> volunteering Spark 3.5.5 release manager.
>
> Sincerely,
> Dongjoon
>
>
> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <[email protected]> wrote:
>
>> It’s unfortunate that we missed identifying these issues during the code
>> review. However, since they have already been released, I believe
>> deprecating them is a better approach than removing them, as the latter
>> would introduce a breaking change.
>>
>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>, it
>> looks like there are only a few lines of migration code. Would it be
>> acceptable to leave them for legacy support? With the new config name style
>> check rule in place, such issues should not occur again in the future.
>>
>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim <
>> [email protected]> wrote:
>>
>>> I think I can add a color to minimize the concern.
>>>
>>> The problematic config we added is arguably not user facing. I'd argue
>>> moderate users wouldn't even understand what the flag is doing. The config
>>> was added because Structured Streaming has been leveraging SQL config to
>>> "do the magic" on having two different default values for new query vs old
>>> query (checkpoint is created from the version where the fix is not landed).
>>> This is purely used for backward compatibility, not something we want to
>>> give users flexibility.
>>>
>>> That said, I don't see a risk of removing config "at any point". (I'd
>>> even say removing this config in Spark 3.5.5 does not change anything. The
>>> reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just
>>> to address any concern on being conservative.)
>>>
>>> I think you are worrying about case 1 from my comment. From my new
>>> change (link <https://github.com/apache/spark/pull/49983>), I made a
>>> migration logic when the offset log contains the problematic configuration
>>> - we will take the value, but put the value to the new config, and at the
>>> next microbatch planning, the offset log will contain the new configuration
>>> going forward. This addresses the case 1, as long as we retain the
>>> migration logic for a couple minor releases (say, 4.2 or so). We just need
>>> to support this migration logic for the time where we never thought of
>>> jumping directly from Spark 3.5.4 to the version.
>>>
>>> Hope this helps to address your concern/worrying.
>>>
>>>
>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen <
>>> [email protected]> wrote:
>>>
>>>>
>>>> Having breaking changes in a minor seems not that good.. As I'm reading
>>>> this,
>>>>
>>>> "*This could break the query if the rule impacts the query, because
>>>> the effectiveness of the fix is flipped.*"
>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140
>>>>
>>>>
>>>> What if we have this https://github.com/apache/spark/pull/48149 change
>>>> in the branch and remove it only for version 4? That way we dont break
>>>> anything.
>>>>
>>>>
>>>>
>>>>
>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun <
>>>> [email protected]>:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> I'd like to highlight this discussion because this is more important
>>>>> and tricky in a way.
>>>>>
>>>>> As already mentioned in the mailing list and PRs, there was an obvious
>>>>> mistake
>>>>> which missed an improper configuration name, `spark.databricks.*`.
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424
>>>>>
>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan`
>>>>>
>>>>> In fact, Apache Spark committers have been preventing this repetitive
>>>>> mistake
>>>>> pattern during the review stages successfully until we slip the
>>>>> following backportings
>>>>> at Apache Spark 3.5.4.
>>>>>
>>>>> https://github.com/apache/spark/pull/45649
>>>>> https://github.com/apache/spark/pull/48149
>>>>> https://github.com/apache/spark/pull/49121
>>>>>
>>>>> At this point of writing, `spark.databricks.*` was removed
>>>>> successfully from `master`
>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect Apache
>>>>> Spark repository
>>>>> from future mistakes.
>>>>>
>>>>> SPARK-51172 Rename to
>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan
>>>>> SPARK-51173 Add `configName` Scalastyle rule
>>>>>
>>>>> What I proposed is to release Apache Spark 3.5.5 next week with the
>>>>> deprecation
>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*`
>>>>> configuration.
>>>>>
>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with
>>>>> alternative)
>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config)
>>>>>
>>>>> In addition, I'd like to volunteer as a release manager of Apache
>>>>> Spark 3.5.5
>>>>> for a swift release. WDYT?
>>>>>
>>>>> FYI, `branch-3.5` has 37 patches currently.
>>>>>
>>>>> $ git log --oneline v3.5.4..HEAD | wc -l
>>>>>       37
>>>>>
>>>>> Best Regards,
>>>>> Dongjoon.
>>>>>
>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>> Norge
>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>>>
>>>> +47 480 94 297
>>>>
>>>

Re: Deprecating and banning `spark.databricks.*` config from Apache Spark repository

Reply via email to