Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Jungtaek Lim Mon, 10 Mar 2025 18:34:43 -0700

Please read through the explanation of how this impacts the OSS users in
the other branch of this discussion. This happened in "Apache" Spark 3.5.4,
and the migration logic has nothing to do with the vendor. This is
primarily to not break users in "Apache" Spark 3.5.4 who are willing to
upgrade directly to "Apache" Spark 4.0.0+.


I'm not implying anything about compatibility between OSS and vendors. I
would remind that the problematic config name is not problematic for the
vendor, so the thing the vendor needs to do is just to alias incorrect
config name and new config name and done. The vendor does not need to have
such a migration code. This is just to do due diligence of mitigating the
mistake we have made. Again, if the migration logic does not land to Apache
Spark 4.0.x, only "Apache" Spark users will be impacted.

On Tue, Mar 11, 2025 at 10:25 AM Andrew Melo <[email protected]> wrote:

> Hello Jungtaek,
>
> I'm not implying that this improves the vendors life. I'm just not
> understanding the issue -- the downstream people started a stream with a
> config option that the upstream people don't want to carry. If the affected
> users are using the downstream fork (which is how they got the option),
> then why can't those same downstream users not keep using their
> vendor-provided downstream fork.
>
> To put it into less complicated words - presumably the people using the
> databricks.* configs were already using the databricks runtime. Why does
> Apache spark need to carry an extra migration patch when the users who
> would be affected are already using the Databricks fork? I don't see a
> situation where:
>
> A) legacy 3.5.x queries were using databricks-specific options
> B) these users want to run the same queries in OSS spark today
> C) the same people will not be using the databricks fork.
>
> This Venn diagram seems very small, and I don't think it justifies
> carrying migration code for that one sliver of users.
>
> Andrew
>
> On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim <[email protected]>
> wrote:
> >
> > One thing I can correct immediately is, downstream does not have any
> impact at all from this. I believe I clarified that the config will not be
> modified by anyone, so downstream there is nothing to change. The problem
> is particular in OSS, downstream does not have any issue with this leak at
> all.
> > (One thing to clarify, config itself will be removed in Spark 4.0.0. We
> only propose to retain migration logic.)
> >
> > I believe there is a huge misunderstanding - we are not proposing this
> migration logic to ease the vendor's life, no it's not. If I don't care
> about OSS, there is no incentive for me to propose this.
> >
> > I just wanted to do my best to remove any burden to users who are
> innocent with this problem. If there is no migration logic, users will be
> enforced to upgrade to Spark 3.5.5+ before upgrading to Spark 4.0.0+. Isn't
> it bad enough? Why do we have to let users be bugging while we can avoid
> it? The problem was introduced in the OSS community (I hope we don't blame
> ourselves with mistakes. We are human.) so it is up to us to resolve this
> properly. We don't have the right to break users' queries.
> >
> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <[email protected]>
> wrote:
> >>
> >> Hello all
> >>
> >> As an outsider, I don't fully understand this discussion. This
> >> particular configuration option "leaked" into the open-source Spark
> >> distribution, and now there is a lot of discussion about how to
> >> mitigate existing workloads. But: presumably the people who are
> >> depending on this configuration flag are already using a downstream
> >> (vendor-specific) fork, and a future update will similarly be
> >> distributed by that downstream provider.
> >>
> >> Which people a) made a workflow using the vendor fork and b) want to
> >> resume it in the OSS version of spark?
> >>
> >> It seems like the people who are affected by this will already be
> >> using someone else's fork, and there's no need to carry this patch in
> >> the mainline Spark code.
> >>
> >> For that reason, I believe the code should be dropped by OSS Spark,
> >> and vendors who need to mitigate it can push the appropriate changes
> >> to their downstreams.
> >>
> >> Thanks
> >> Andrew
>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to