Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Andrew Melo Mon, 10 Mar 2025 18:25:39 -0700

Hello Jungtaek,

I'm not implying that this improves the vendors life. I'm just not
understanding the issue -- the downstream people started a stream with a
config option that the upstream people don't want to carry. If the affected
users are using the downstream fork (which is how they got the option),
then why can't those same downstream users not keep using their
vendor-provided downstream fork.


To put it into less complicated words - presumably the people using the
databricks.* configs were already using the databricks runtime. Why does
Apache spark need to carry an extra migration patch when the users who
would be affected are already using the Databricks fork? I don't see a
situation where:

A) legacy 3.5.x queries were using databricks-specific options
B) these users want to run the same queries in OSS spark today
C) the same people will not be using the databricks fork.

This Venn diagram seems very small, and I don't think it justifies carrying
migration code for that one sliver of users.

Andrew

On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:
>
> One thing I can correct immediately is, downstream does not have any
impact at all from this. I believe I clarified that the config will not be
modified by anyone, so downstream there is nothing to change. The problem
is particular in OSS, downstream does not have any issue with this leak at
all.
> (One thing to clarify, config itself will be removed in Spark 4.0.0. We
only propose to retain migration logic.)
>
> I believe there is a huge misunderstanding - we are not proposing this
migration logic to ease the vendor's life, no it's not. If I don't care
about OSS, there is no incentive for me to propose this.
>
> I just wanted to do my best to remove any burden to users who are
innocent with this problem. If there is no migration logic, users will be
enforced to upgrade to Spark 3.5.5+ before upgrading to Spark 4.0.0+. Isn't
it bad enough? Why do we have to let users be bugging while we can avoid
it? The problem was introduced in the OSS community (I hope we don't blame
ourselves with mistakes. We are human.) so it is up to us to resolve this
properly. We don't have the right to break users' queries.
>
> On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <andrew.m...@gmail.com> wrote:
>>
>> Hello all
>>
>> As an outsider, I don't fully understand this discussion. This
>> particular configuration option "leaked" into the open-source Spark
>> distribution, and now there is a lot of discussion about how to
>> mitigate existing workloads. But: presumably the people who are
>> depending on this configuration flag are already using a downstream
>> (vendor-specific) fork, and a future update will similarly be
>> distributed by that downstream provider.
>>
>> Which people a) made a workflow using the vendor fork and b) want to
>> resume it in the OSS version of spark?
>>
>> It seems like the people who are affected by this will already be
>> using someone else's fork, and there's no need to carry this patch in
>> the mainline Spark code.
>>
>> For that reason, I believe the code should be dropped by OSS Spark,
>> and vendors who need to mitigate it can push the appropriate changes
>> to their downstreams.
>>
>> Thanks
>> Andrew

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to