That said, if you guys understand the proposal better and have a preference on one side, could you please participate in the VOTE thread? https://lists.apache.org/thread/nm3p1zjcybdl0p0mc56t2rl92hb9837n
Specifically on this topic, I do think the input from users is very important, especially if you are maintaining streaming queries in your daily job. Thanks! On Tue, Mar 11, 2025 at 12:18 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Thanks for looking into the issue in depth. What you described is right. > > I also understand the concern why we keep the buggy behavior, but the QO > issue is quite complicated and the most concerning part is that it's > "selective". So if the query runs with QO's decision in "one way" in its > lifecycle, it does not hit any issue, despite it going through the > "fragile" path. If we don't make backward compatibility and enforce the > fix, the existing query is at risk of "immediate failure on upgrade". I > understand there are arguments about what is the right path, but the > decision was made to not break the existing query from this fix. That's it. > > So the target workloads to help are every single streaming query that has > ever "started" with Apache Spark 3.5.4. To help these workloads to not > re-expose the nasty bug, we just need to have 4 lines of code. I understand > someone may be concerned about ever having the vendor name in the codebase, > but likewise Wenchen said, it had been there, and no one had been providing > the exact ASF policy of this. If, ASF policy is well clearly stated that > the codebase shouldn't have vendor name on it, there is no argument. > Although I wonder whether this is realistic at all e.g. I can argue that we > must remove the code comment of "Apple" Silicon to just "Silicon" or "M > series" as it contains the vendor name. How about Meta? Should we prohibit > Metadata be referenced with Meta? Is it really making everyone happy? > > > > On Tue, Mar 11, 2025 at 11:12 AM Adam Binford <adam...@gmail.com> wrote: > >> I was very confused about this as well but I think I understand it more >> after reading through the PRs. Jungtaek let me know if this is correct, >> maybe it will help others understand. >> >> There was a bug where streaming queries could prune parts of the query >> that might have side effects, like stateful queries or metric collection. >> This bug was fixed in OSS Spark 3.5.4, with a new config that was >> accidentally added as "spark.databricks.sql". This config defaulted to >> false, which is the value that "fixes" the bug. The config was added to >> maintain the current behavior for existing structured streams. So for any >> streams that were started in 3.5.3 or earlier, when they are restarted in >> 3.5.4, the offset metadata log included a default value of "true" for the >> "spark.databricks.sql" config, so as to keep the existing buggy behavior >> for existing streams. Any new stream started in 3.5.4 would have this value >> set to "false" in the offset metadata log because that is the default value >> for the config. In 3.5.5 the config name was corrected to remove >> "databricks". >> >> I'm not totally sure the reason for keeping the buggy behavior for >> existing streams. It seems like eventually you will have to restart these >> streams from scratch to fix the buggy behavior if you happen to hit the >> edge case with the bug that causes problems. The case that keeping the >> handling of the incorrect config name is trying to handle is: >> - A new stream was started in 3.5.4 with the incorrect config name set to >> "false" in the metadata log, fixing the bug >> - This stream is restarted from the existing checkpoint in 4.0.0+ >> - Without the incorrect config handling logic, this restarted stream >> would revert back to the "buggy" behavior until the stream is restarted >> from a fresh checkpoint >> - With the incorrect config handling, a stream restarted from an existing >> checkpoint in 4.0.0 would maintain the bug fix while fixing the config name >> in the metadata log >> >> Without the special handling of the incorrect config name, for all >> intents and purposes this bug was fixed in 3.5.5 and later instead of >> 3.5.4, so it doesn't seem like a huge deal either way if the code is >> briefly maintained to help this specific edge case. Any stream started in >> 3.5.3 would need to be restarted either in 3.5.4 or 3.5.5 and later to have >> the bug fix, but if this bug is causing breaking problems for your query >> you probably would need to restart it from scratch anyway. >> >> At the end of the day there's probably no reason not to include 4 lines >> of code if it makes a few peoples' lives easier, not sure how many people >> are affected by this bug. >> >> Adam >> >> On Mon, Mar 10, 2025 at 10:02 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Replied inline >>> >>> On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo <andrew.m...@gmail.com> >>> wrote: >>> >>>> Hi Jungtaek, >>>> >>>> I've read the discussion, which is why I replied with my questions >>>> (which you neglected to answer). Your deflection and lack of response >>>> to direct questions should be (IMO) disqualifying. So, again: >>>> >>>> To put it into less complicated words - presumably the people using >>>> the databricks.* configs were already using the databricks runtime. >>>> Why does Apache spark need to carry an extra migration patch when the >>>> users who would be affected are already using the Databricks fork? I >>>> don't see a situation where: >>>> >>>> A) legacy 3.5.x queries were using databricks-specific options >>>> B) these users want to run the same queries in OSS spark today >>>> C) the same people will not be using the databricks fork. >>>> >>> >>> I don't understand why you couple this with Databricks Runtime. >>> >>> If I were pushing this to the case where the config was only released in >>> Databricks Runtime and not yet to be released in Apache Spark, you are >>> definitely right, although there are a bunch of other ways around to solve >>> the issue on the vendor side, and I wouldn't choose this way which would be >>> expected to have a lot of burden. >>> >>> This config was released to "Apache" Spark 3.5.4, so this is NO LONGER >>> just a problem with vendor distribution. The breakage will happen even if >>> someone does not even know about Databricks Runtime at all and keeps using >>> Apache Spark. It just happens when users in Apache Spark 3.5.4 directly >>> upgrade to Apache Spark 4.0.0+ (rather than upgrading once to Apache Spark >>> 3.5.5+). >>> (Again, from the vendor codebase, it is not a problem at all to have >>> such a config and there is a much easier way to deal with it - migration >>> logic is not needed at all.) >>> >>> I think this is a major misunderstanding that you see in this discussion >>> to push vendor's code into OSS. Again, the migration logic is needed only >>> by OSS. I can guarantee that I am not doing this for a vendor. (I made my >>> first ASF contribution in 2014 and was a long time PMC member in the Apache >>> Storm project. I'm betting more than 10 years of contribution on OSS.). >>> >>> Technically speaking, I'm effectively wasting my time to push the thing >>> I think is the right way. I should be OK not to push this hard and it's not >>> the best use of time from a vendor perspective. This is really just due >>> diligence of the mistake, because I feel I need to do that. >>> >>> >>>> Without a direct response to this, I think this discussion should be >>>> considered to just be what it is on it's face -- a solution to a >>>> vendors mistake and should not be ported to OSS spark. >>>> >>>> Thanks >>>> Andrew >>>> >>>> On Mon, Mar 10, 2025 at 8:34 PM Jungtaek Lim >>>> <kabhwan.opensou...@gmail.com> wrote: >>>> > >>>> > Please read through the explanation of how this impacts the OSS users >>>> in the other branch of this discussion. This happened in "Apache" Spark >>>> 3.5.4, and the migration logic has nothing to do with the vendor. This is >>>> primarily to not break users in "Apache" Spark 3.5.4 who are willing to >>>> upgrade directly to "Apache" Spark 4.0.0+. >>>> > >>>> > I'm not implying anything about compatibility between OSS and >>>> vendors. I would remind that the problematic config name is not problematic >>>> for the vendor, so the thing the vendor needs to do is just to alias >>>> incorrect config name and new config name and done. The vendor does not >>>> need to have such a migration code. This is just to do due diligence of >>>> mitigating the mistake we have made. Again, if the migration logic does not >>>> land to Apache Spark 4.0.x, only "Apache" Spark users will be impacted. >>>> > >>>> > On Tue, Mar 11, 2025 at 10:25 AM Andrew Melo <andrew.m...@gmail.com> >>>> wrote: >>>> >> >>>> >> Hello Jungtaek, >>>> >> >>>> >> I'm not implying that this improves the vendors life. I'm just not >>>> understanding the issue -- the downstream people started a stream with a >>>> config option that the upstream people don't want to carry. If the affected >>>> users are using the downstream fork (which is how they got the option), >>>> then why can't those same downstream users not keep using their >>>> vendor-provided downstream fork. >>>> >> >>>> >> To put it into less complicated words - presumably the people using >>>> the databricks.* configs were already using the databricks runtime. Why >>>> does Apache spark need to carry an extra migration patch when the users who >>>> would be affected are already using the Databricks fork? I don't see a >>>> situation where: >>>> >> >>>> >> A) legacy 3.5.x queries were using databricks-specific options >>>> >> B) these users want to run the same queries in OSS spark today >>>> >> C) the same people will not be using the databricks fork. >>>> >> >>>> >> This Venn diagram seems very small, and I don't think it justifies >>>> carrying migration code for that one sliver of users. >>>> >> >>>> >> Andrew >>>> >> >>>> >> On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >> > >>>> >> > One thing I can correct immediately is, downstream does not have >>>> any impact at all from this. I believe I clarified that the config will not >>>> be modified by anyone, so downstream there is nothing to change. The >>>> problem is particular in OSS, downstream does not have any issue with this >>>> leak at all. >>>> >> > (One thing to clarify, config itself will be removed in Spark >>>> 4.0.0. We only propose to retain migration logic.) >>>> >> > >>>> >> > I believe there is a huge misunderstanding - we are not proposing >>>> this migration logic to ease the vendor's life, no it's not. If I don't >>>> care about OSS, there is no incentive for me to propose this. >>>> >> > >>>> >> > I just wanted to do my best to remove any burden to users who are >>>> innocent with this problem. If there is no migration logic, users will be >>>> enforced to upgrade to Spark 3.5.5+ before upgrading to Spark 4.0.0+. Isn't >>>> it bad enough? Why do we have to let users be bugging while we can avoid >>>> it? The problem was introduced in the OSS community (I hope we don't blame >>>> ourselves with mistakes. We are human.) so it is up to us to resolve this >>>> properly. We don't have the right to break users' queries. >>>> >> > >>>> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <andrew.m...@gmail.com> >>>> wrote: >>>> >> >> >>>> >> >> Hello all >>>> >> >> >>>> >> >> As an outsider, I don't fully understand this discussion. This >>>> >> >> particular configuration option "leaked" into the open-source >>>> Spark >>>> >> >> distribution, and now there is a lot of discussion about how to >>>> >> >> mitigate existing workloads. But: presumably the people who are >>>> >> >> depending on this configuration flag are already using a >>>> downstream >>>> >> >> (vendor-specific) fork, and a future update will similarly be >>>> >> >> distributed by that downstream provider. >>>> >> >> >>>> >> >> Which people a) made a workflow using the vendor fork and b) want >>>> to >>>> >> >> resume it in the OSS version of spark? >>>> >> >> >>>> >> >> It seems like the people who are affected by this will already be >>>> >> >> using someone else's fork, and there's no need to carry this >>>> patch in >>>> >> >> the mainline Spark code. >>>> >> >> >>>> >> >> For that reason, I believe the code should be dropped by OSS >>>> Spark, >>>> >> >> and vendors who need to mitigate it can push the appropriate >>>> changes >>>> >> >> to their downstreams. >>>> >> >> >>>> >> >> Thanks >>>> >> >> Andrew >>>> >>> >> >> -- >> Adam Binford >> >