Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Jungtaek Lim Mon, 10 Mar 2025 19:03:35 -0700

Replied inline

On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo <[email protected]> wrote:


> Hi Jungtaek,
>
> I've read the discussion, which is why I replied with my questions
> (which you neglected to answer). Your deflection and lack of response
> to direct questions should be (IMO) disqualifying. So, again:
>
> To put it into less complicated words - presumably the people using
> the databricks.* configs were already using the databricks runtime.
> Why does Apache spark need to carry an extra migration patch when the
> users who would be affected are already using the Databricks fork? I
> don't see a situation where:
>
> A) legacy 3.5.x queries were using databricks-specific options
> B) these users want to run the same queries in OSS spark today
> C) the same people will not be using the databricks fork.
>

I don't understand why you couple this with Databricks Runtime.

If I were pushing this to the case where the config was only released in
Databricks Runtime and not yet to be released in Apache Spark, you are
definitely right, although there are a bunch of other ways around to solve
the issue on the vendor side, and I wouldn't choose this way which would be
expected to have a lot of burden.

This config was released to "Apache" Spark 3.5.4, so this is NO LONGER just
a problem with vendor distribution. The breakage will happen even if
someone does not even know about Databricks Runtime at all and keeps using
Apache Spark. It just happens when users in Apache Spark 3.5.4 directly
upgrade to Apache Spark 4.0.0+ (rather than upgrading once to Apache Spark
3.5.5+).
(Again, from the vendor codebase, it is not a problem at all to have such a
config and there is a much easier way to deal with it - migration logic is
not needed at all.)

I think this is a major misunderstanding that you see in this discussion
to push vendor's code into OSS. Again, the migration logic is needed only
by OSS. I can guarantee that I am not doing this for a vendor. (I made my
first ASF contribution in 2014 and was a long time PMC member in the Apache
Storm project. I'm betting more than 10 years of contribution on OSS.).

Technically speaking, I'm effectively wasting my time to push the thing I
think is the right way. I should be OK not to push this hard and it's not
the best use of time from a vendor perspective. This is really just due
diligence of the mistake, because I feel I need to do that.


> Without a direct response to this, I think this discussion should be
> considered to just be what it is on it's face -- a solution to a
> vendors mistake and should not be ported to OSS spark.
>
> Thanks
> Andrew
>
> On Mon, Mar 10, 2025 at 8:34 PM Jungtaek Lim
> <[email protected]> wrote:
> >
> > Please read through the explanation of how this impacts the OSS users in
> the other branch of this discussion. This happened in "Apache" Spark 3.5.4,
> and the migration logic has nothing to do with the vendor. This is
> primarily to not break users in "Apache" Spark 3.5.4 who are willing to
> upgrade directly to "Apache" Spark 4.0.0+.
> >
> > I'm not implying anything about compatibility between OSS and vendors. I
> would remind that the problematic config name is not problematic for the
> vendor, so the thing the vendor needs to do is just to alias incorrect
> config name and new config name and done. The vendor does not need to have
> such a migration code. This is just to do due diligence of mitigating the
> mistake we have made. Again, if the migration logic does not land to Apache
> Spark 4.0.x, only "Apache" Spark users will be impacted.
> >
> > On Tue, Mar 11, 2025 at 10:25 AM Andrew Melo <[email protected]>
> wrote:
> >>
> >> Hello Jungtaek,
> >>
> >> I'm not implying that this improves the vendors life. I'm just not
> understanding the issue -- the downstream people started a stream with a
> config option that the upstream people don't want to carry. If the affected
> users are using the downstream fork (which is how they got the option),
> then why can't those same downstream users not keep using their
> vendor-provided downstream fork.
> >>
> >> To put it into less complicated words - presumably the people using the
> databricks.* configs were already using the databricks runtime. Why does
> Apache spark need to carry an extra migration patch when the users who
> would be affected are already using the Databricks fork? I don't see a
> situation where:
> >>
> >> A) legacy 3.5.x queries were using databricks-specific options
> >> B) these users want to run the same queries in OSS spark today
> >> C) the same people will not be using the databricks fork.
> >>
> >> This Venn diagram seems very small, and I don't think it justifies
> carrying migration code for that one sliver of users.
> >>
> >> Andrew
> >>
> >> On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim <
> [email protected]> wrote:
> >> >
> >> > One thing I can correct immediately is, downstream does not have any
> impact at all from this. I believe I clarified that the config will not be
> modified by anyone, so downstream there is nothing to change. The problem
> is particular in OSS, downstream does not have any issue with this leak at
> all.
> >> > (One thing to clarify, config itself will be removed in Spark 4.0.0.
> We only propose to retain migration logic.)
> >> >
> >> > I believe there is a huge misunderstanding - we are not proposing
> this migration logic to ease the vendor's life, no it's not. If I don't
> care about OSS, there is no incentive for me to propose this.
> >> >
> >> > I just wanted to do my best to remove any burden to users who are
> innocent with this problem. If there is no migration logic, users will be
> enforced to upgrade to Spark 3.5.5+ before upgrading to Spark 4.0.0+. Isn't
> it bad enough? Why do we have to let users be bugging while we can avoid
> it? The problem was introduced in the OSS community (I hope we don't blame
> ourselves with mistakes. We are human.) so it is up to us to resolve this
> properly. We don't have the right to break users' queries.
> >> >
> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <[email protected]>
> wrote:
> >> >>
> >> >> Hello all
> >> >>
> >> >> As an outsider, I don't fully understand this discussion. This
> >> >> particular configuration option "leaked" into the open-source Spark
> >> >> distribution, and now there is a lot of discussion about how to
> >> >> mitigate existing workloads. But: presumably the people who are
> >> >> depending on this configuration flag are already using a downstream
> >> >> (vendor-specific) fork, and a future update will similarly be
> >> >> distributed by that downstream provider.
> >> >>
> >> >> Which people a) made a workflow using the vendor fork and b) want to
> >> >> resume it in the OSS version of spark?
> >> >>
> >> >> It seems like the people who are affected by this will already be
> >> >> using someone else's fork, and there's no need to carry this patch in
> >> >> the mainline Spark code.
> >> >>
> >> >> For that reason, I believe the code should be dropped by OSS Spark,
> >> >> and vendors who need to mitigate it can push the appropriate changes
> >> >> to their downstreams.
> >> >>
> >> >> Thanks
> >> >> Andrew
>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to