Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Andrew Melo Mon, 10 Mar 2025 19:52:50 -0700

Hi Jungtaek

below


On Mon, Mar 10, 2025 at 9:02 PM Jungtaek Lim
<kabhwan.opensou...@gmail.com> wrote:
>
> Replied inline
>
> On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo <andrew.m...@gmail.com> wrote:
>>
>> Hi Jungtaek,
>>
>> I've read the discussion, which is why I replied with my questions
>> (which you neglected to answer). Your deflection and lack of response
>> to direct questions should be (IMO) disqualifying. So, again:
>>
>> To put it into less complicated words - presumably the people using
>> the databricks.* configs were already using the databricks runtime.
>> Why does Apache spark need to carry an extra migration patch when the
>> users who would be affected are already using the Databricks fork? I
>> don't see a situation where:
>>
>> A) legacy 3.5.x queries were using databricks-specific options
>> B) these users want to run the same queries in OSS spark today
>> C) the same people will not be using the databricks fork.
>
>
> I don't understand why you couple this with Databricks Runtime.
>
> If I were pushing this to the case where the config was only released in 
> Databricks Runtime and not yet to be released in Apache Spark, you are 
> definitely right, although there are a bunch of other ways around to solve 
> the issue on the vendor side, and I wouldn't choose this way which would be 
> expected to have a lot of burden.
>
> This config was released to "Apache" Spark 3.5.4, so this is NO LONGER just a 
> problem with vendor distribution. The breakage will happen even if someone 
> does not even know about Databricks Runtime at all and keeps using Apache 
> Spark. It just happens when users in Apache Spark 3.5.4 directly upgrade to 
> Apache Spark 4.0.0+ (rather than upgrading once to Apache Spark 3.5.5+).
> (Again, from the vendor codebase, it is not a problem at all to have such a 
> config and there is a much easier way to deal with it - migration logic is 
> not needed at all.)
>
> I think this is a major misunderstanding that you see in this discussion to 
> push vendor's code into OSS. Again, the migration logic is needed only by 
> OSS. I can guarantee that I am not doing this for a vendor. (I made my first 
> ASF contribution in 2014 and was a long time PMC member in the Apache Storm 
> project. I'm betting more than 10 years of contribution on OSS.).
>
> Technically speaking, I'm effectively wasting my time to push the thing I 
> think is the right way. I should be OK not to push this hard and it's not the 
> best use of time from a vendor perspective. This is really just due diligence 
> of the mistake, because I feel I need to do that.

This is code that will be carried for sometime, and regardless of what
you responded to, I asked about what *users* would be affected. I
think it's fine to let your response speak for itself.

Thanks
Andrew

>
>>
>> Without a direct response to this, I think this discussion should be
>> considered to just be what it is on it's face -- a solution to a
>> vendors mistake and should not be ported to OSS spark.
>>
>> Thanks
>> Andrew
>>
>> On Mon, Mar 10, 2025 at 8:34 PM Jungtaek Lim
>> <kabhwan.opensou...@gmail.com> wrote:
>> >
>> > Please read through the explanation of how this impacts the OSS users in 
>> > the other branch of this discussion. This happened in "Apache" Spark 
>> > 3.5.4, and the migration logic has nothing to do with the vendor. This is 
>> > primarily to not break users in "Apache" Spark 3.5.4 who are willing to 
>> > upgrade directly to "Apache" Spark 4.0.0+.
>> >
>> > I'm not implying anything about compatibility between OSS and vendors. I 
>> > would remind that the problematic config name is not problematic for the 
>> > vendor, so the thing the vendor needs to do is just to alias incorrect 
>> > config name and new config name and done. The vendor does not need to have 
>> > such a migration code. This is just to do due diligence of mitigating the 
>> > mistake we have made. Again, if the migration logic does not land to 
>> > Apache Spark 4.0.x, only "Apache" Spark users will be impacted.
>> >
>> > On Tue, Mar 11, 2025 at 10:25 AM Andrew Melo <andrew.m...@gmail.com> wrote:
>> >>
>> >> Hello Jungtaek,
>> >>
>> >> I'm not implying that this improves the vendors life. I'm just not 
>> >> understanding the issue -- the downstream people started a stream with a 
>> >> config option that the upstream people don't want to carry. If the 
>> >> affected users are using the downstream fork (which is how they got the 
>> >> option), then why can't those same downstream users not keep using their 
>> >> vendor-provided downstream fork.
>> >>
>> >> To put it into less complicated words - presumably the people using the 
>> >> databricks.* configs were already using the databricks runtime. Why does 
>> >> Apache spark need to carry an extra migration patch when the users who 
>> >> would be affected are already using the Databricks fork? I don't see a 
>> >> situation where:
>> >>
>> >> A) legacy 3.5.x queries were using databricks-specific options
>> >> B) these users want to run the same queries in OSS spark today
>> >> C) the same people will not be using the databricks fork.
>> >>
>> >> This Venn diagram seems very small, and I don't think it justifies 
>> >> carrying migration code for that one sliver of users.
>> >>
>> >> Andrew
>> >>
>> >> On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim 
>> >> <kabhwan.opensou...@gmail.com> wrote:
>> >> >
>> >> > One thing I can correct immediately is, downstream does not have any 
>> >> > impact at all from this. I believe I clarified that the config will not 
>> >> > be modified by anyone, so downstream there is nothing to change. The 
>> >> > problem is particular in OSS, downstream does not have any issue with 
>> >> > this leak at all.
>> >> > (One thing to clarify, config itself will be removed in Spark 4.0.0. We 
>> >> > only propose to retain migration logic.)
>> >> >
>> >> > I believe there is a huge misunderstanding - we are not proposing this 
>> >> > migration logic to ease the vendor's life, no it's not. If I don't care 
>> >> > about OSS, there is no incentive for me to propose this.
>> >> >
>> >> > I just wanted to do my best to remove any burden to users who are 
>> >> > innocent with this problem. If there is no migration logic, users will 
>> >> > be enforced to upgrade to Spark 3.5.5+ before upgrading to Spark 
>> >> > 4.0.0+. Isn't it bad enough? Why do we have to let users be bugging 
>> >> > while we can avoid it? The problem was introduced in the OSS community 
>> >> > (I hope we don't blame ourselves with mistakes. We are human.) so it is 
>> >> > up to us to resolve this properly. We don't have the right to break 
>> >> > users' queries.
>> >> >
>> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <andrew.m...@gmail.com> 
>> >> > wrote:
>> >> >>
>> >> >> Hello all
>> >> >>
>> >> >> As an outsider, I don't fully understand this discussion. This
>> >> >> particular configuration option "leaked" into the open-source Spark
>> >> >> distribution, and now there is a lot of discussion about how to
>> >> >> mitigate existing workloads. But: presumably the people who are
>> >> >> depending on this configuration flag are already using a downstream
>> >> >> (vendor-specific) fork, and a future update will similarly be
>> >> >> distributed by that downstream provider.
>> >> >>
>> >> >> Which people a) made a workflow using the vendor fork and b) want to
>> >> >> resume it in the OSS version of spark?
>> >> >>
>> >> >> It seems like the people who are affected by this will already be
>> >> >> using someone else's fork, and there's no need to carry this patch in
>> >> >> the mainline Spark code.
>> >> >>
>> >> >> For that reason, I believe the code should be dropped by OSS Spark,
>> >> >> and vendors who need to mitigate it can push the appropriate changes
>> >> >> to their downstreams.
>> >> >>
>> >> >> Thanks
>> >> >> Andrew

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to