Hi Jungtaek below
On Mon, Mar 10, 2025 at 9:02 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > > Replied inline > > On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo <andrew.m...@gmail.com> wrote: >> >> Hi Jungtaek, >> >> I've read the discussion, which is why I replied with my questions >> (which you neglected to answer). Your deflection and lack of response >> to direct questions should be (IMO) disqualifying. So, again: >> >> To put it into less complicated words - presumably the people using >> the databricks.* configs were already using the databricks runtime. >> Why does Apache spark need to carry an extra migration patch when the >> users who would be affected are already using the Databricks fork? I >> don't see a situation where: >> >> A) legacy 3.5.x queries were using databricks-specific options >> B) these users want to run the same queries in OSS spark today >> C) the same people will not be using the databricks fork. > > > I don't understand why you couple this with Databricks Runtime. > > If I were pushing this to the case where the config was only released in > Databricks Runtime and not yet to be released in Apache Spark, you are > definitely right, although there are a bunch of other ways around to solve > the issue on the vendor side, and I wouldn't choose this way which would be > expected to have a lot of burden. > > This config was released to "Apache" Spark 3.5.4, so this is NO LONGER just a > problem with vendor distribution. The breakage will happen even if someone > does not even know about Databricks Runtime at all and keeps using Apache > Spark. It just happens when users in Apache Spark 3.5.4 directly upgrade to > Apache Spark 4.0.0+ (rather than upgrading once to Apache Spark 3.5.5+). > (Again, from the vendor codebase, it is not a problem at all to have such a > config and there is a much easier way to deal with it - migration logic is > not needed at all.) > > I think this is a major misunderstanding that you see in this discussion to > push vendor's code into OSS. Again, the migration logic is needed only by > OSS. I can guarantee that I am not doing this for a vendor. (I made my first > ASF contribution in 2014 and was a long time PMC member in the Apache Storm > project. I'm betting more than 10 years of contribution on OSS.). > > Technically speaking, I'm effectively wasting my time to push the thing I > think is the right way. I should be OK not to push this hard and it's not the > best use of time from a vendor perspective. This is really just due diligence > of the mistake, because I feel I need to do that. This is code that will be carried for sometime, and regardless of what you responded to, I asked about what *users* would be affected. I think it's fine to let your response speak for itself. Thanks Andrew > >> >> Without a direct response to this, I think this discussion should be >> considered to just be what it is on it's face -- a solution to a >> vendors mistake and should not be ported to OSS spark. >> >> Thanks >> Andrew >> >> On Mon, Mar 10, 2025 at 8:34 PM Jungtaek Lim >> <kabhwan.opensou...@gmail.com> wrote: >> > >> > Please read through the explanation of how this impacts the OSS users in >> > the other branch of this discussion. This happened in "Apache" Spark >> > 3.5.4, and the migration logic has nothing to do with the vendor. This is >> > primarily to not break users in "Apache" Spark 3.5.4 who are willing to >> > upgrade directly to "Apache" Spark 4.0.0+. >> > >> > I'm not implying anything about compatibility between OSS and vendors. I >> > would remind that the problematic config name is not problematic for the >> > vendor, so the thing the vendor needs to do is just to alias incorrect >> > config name and new config name and done. The vendor does not need to have >> > such a migration code. This is just to do due diligence of mitigating the >> > mistake we have made. Again, if the migration logic does not land to >> > Apache Spark 4.0.x, only "Apache" Spark users will be impacted. >> > >> > On Tue, Mar 11, 2025 at 10:25 AM Andrew Melo <andrew.m...@gmail.com> wrote: >> >> >> >> Hello Jungtaek, >> >> >> >> I'm not implying that this improves the vendors life. I'm just not >> >> understanding the issue -- the downstream people started a stream with a >> >> config option that the upstream people don't want to carry. If the >> >> affected users are using the downstream fork (which is how they got the >> >> option), then why can't those same downstream users not keep using their >> >> vendor-provided downstream fork. >> >> >> >> To put it into less complicated words - presumably the people using the >> >> databricks.* configs were already using the databricks runtime. Why does >> >> Apache spark need to carry an extra migration patch when the users who >> >> would be affected are already using the Databricks fork? I don't see a >> >> situation where: >> >> >> >> A) legacy 3.5.x queries were using databricks-specific options >> >> B) these users want to run the same queries in OSS spark today >> >> C) the same people will not be using the databricks fork. >> >> >> >> This Venn diagram seems very small, and I don't think it justifies >> >> carrying migration code for that one sliver of users. >> >> >> >> Andrew >> >> >> >> On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim >> >> <kabhwan.opensou...@gmail.com> wrote: >> >> > >> >> > One thing I can correct immediately is, downstream does not have any >> >> > impact at all from this. I believe I clarified that the config will not >> >> > be modified by anyone, so downstream there is nothing to change. The >> >> > problem is particular in OSS, downstream does not have any issue with >> >> > this leak at all. >> >> > (One thing to clarify, config itself will be removed in Spark 4.0.0. We >> >> > only propose to retain migration logic.) >> >> > >> >> > I believe there is a huge misunderstanding - we are not proposing this >> >> > migration logic to ease the vendor's life, no it's not. If I don't care >> >> > about OSS, there is no incentive for me to propose this. >> >> > >> >> > I just wanted to do my best to remove any burden to users who are >> >> > innocent with this problem. If there is no migration logic, users will >> >> > be enforced to upgrade to Spark 3.5.5+ before upgrading to Spark >> >> > 4.0.0+. Isn't it bad enough? Why do we have to let users be bugging >> >> > while we can avoid it? The problem was introduced in the OSS community >> >> > (I hope we don't blame ourselves with mistakes. We are human.) so it is >> >> > up to us to resolve this properly. We don't have the right to break >> >> > users' queries. >> >> > >> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <andrew.m...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Hello all >> >> >> >> >> >> As an outsider, I don't fully understand this discussion. This >> >> >> particular configuration option "leaked" into the open-source Spark >> >> >> distribution, and now there is a lot of discussion about how to >> >> >> mitigate existing workloads. But: presumably the people who are >> >> >> depending on this configuration flag are already using a downstream >> >> >> (vendor-specific) fork, and a future update will similarly be >> >> >> distributed by that downstream provider. >> >> >> >> >> >> Which people a) made a workflow using the vendor fork and b) want to >> >> >> resume it in the OSS version of spark? >> >> >> >> >> >> It seems like the people who are affected by this will already be >> >> >> using someone else's fork, and there's no need to carry this patch in >> >> >> the mainline Spark code. >> >> >> >> >> >> For that reason, I believe the code should be dropped by OSS Spark, >> >> >> and vendors who need to mitigate it can push the appropriate changes >> >> >> to their downstreams. >> >> >> >> >> >> Thanks >> >> >> Andrew --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org