Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Jungtaek Lim Tue, 11 Mar 2025 02:16:22 -0700

That said, if you guys understand the proposal better and have a preference
on one side, could you please participate in the VOTE thread?
https://lists.apache.org/thread/nm3p1zjcybdl0p0mc56t2rl92hb9837n


Specifically on this topic, I do think the input from users is very
important, especially if you are maintaining streaming queries in your
daily job.

Thanks!

On Tue, Mar 11, 2025 at 12:18 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Thanks for looking into the issue in depth. What you described is right.
>
> I also understand the concern why we keep the buggy behavior, but the QO
> issue is quite complicated and the most concerning part is that it's
> "selective". So if the query runs with QO's decision in "one way" in its
> lifecycle, it does not hit any issue, despite it going through the
> "fragile" path. If we don't make backward compatibility and enforce the
> fix, the existing query is at risk of "immediate failure on upgrade". I
> understand there are arguments about what is the right path, but the
> decision was made to not break the existing query from this fix. That's it.
>
> So the target workloads to help are every single streaming query that has
> ever "started" with Apache Spark 3.5.4. To help these workloads to not
> re-expose the nasty bug, we just need to have 4 lines of code. I understand
> someone may be concerned about ever having the vendor name in the codebase,
> but likewise Wenchen said, it had been there, and no one had been providing
> the exact ASF policy of this. If, ASF policy is well clearly stated that
> the codebase shouldn't have vendor name on it, there is no argument.
> Although I wonder whether this is realistic at all e.g. I can argue that we
> must remove the code comment of "Apple" Silicon to just "Silicon" or "M
> series" as it contains the vendor name. How about Meta? Should we prohibit
> Metadata be referenced with Meta? Is it really making everyone happy?
>
>
>
> On Tue, Mar 11, 2025 at 11:12 AM Adam Binford <adam...@gmail.com> wrote:
>
>> I was very confused about this as well but I think I understand it more
>> after reading through the PRs. Jungtaek let me know if this is correct,
>> maybe it will help others understand.
>>
>> There was a bug where streaming queries could prune parts of the query
>> that might have side effects, like stateful queries or metric collection.
>> This bug was fixed in OSS Spark 3.5.4, with a new config that was
>> accidentally added as "spark.databricks.sql". This config defaulted to
>> false, which is the value that "fixes" the bug. The config was added to
>> maintain the current behavior for existing structured streams. So for any
>> streams that were started in 3.5.3 or earlier, when they are restarted in
>> 3.5.4, the offset metadata log included a default value of "true" for the
>> "spark.databricks.sql" config, so as to keep the existing buggy behavior
>> for existing streams. Any new stream started in 3.5.4 would have this value
>> set to "false" in the offset metadata log because that is the default value
>> for the config. In 3.5.5 the config name was corrected to remove
>> "databricks".
>>
>> I'm not totally sure the reason for keeping the buggy behavior for
>> existing streams. It seems like eventually you will have to restart these
>> streams from scratch to fix the buggy behavior if you happen to hit the
>> edge case with the bug that causes problems. The case that keeping the
>> handling of the incorrect config name is trying to handle is:
>> - A new stream was started in 3.5.4 with the incorrect config name set to
>> "false" in the metadata log, fixing the bug
>> - This stream is restarted from the existing checkpoint in 4.0.0+
>> - Without the incorrect config handling logic, this restarted stream
>> would revert back to the "buggy" behavior until the stream is restarted
>> from a fresh checkpoint
>> - With the incorrect config handling, a stream restarted from an existing
>> checkpoint in 4.0.0 would maintain the bug fix while fixing the config name
>> in the metadata log
>>
>> Without the special handling of the incorrect config name, for all
>> intents and purposes this bug was fixed in 3.5.5 and later instead of
>> 3.5.4, so it doesn't seem like a huge deal either way if the code is
>> briefly maintained to help this specific edge case. Any stream started in
>> 3.5.3 would need to be restarted either in 3.5.4 or 3.5.5 and later to have
>> the bug fix, but if this bug is causing breaking problems for your query
>> you probably would need to restart it from scratch anyway.
>>
>> At the end of the day there's probably no reason not to include 4 lines
>> of code if it makes a few peoples' lives easier, not sure how many people
>> are affected by this bug.
>>
>> Adam
>>
>> On Mon, Mar 10, 2025 at 10:02 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Replied inline
>>>
>>> On Tue, Mar 11, 2025 at 10:39 AM Andrew Melo <andrew.m...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> I've read the discussion, which is why I replied with my questions
>>>> (which you neglected to answer). Your deflection and lack of response
>>>> to direct questions should be (IMO) disqualifying. So, again:
>>>>
>>>> To put it into less complicated words - presumably the people using
>>>> the databricks.* configs were already using the databricks runtime.
>>>> Why does Apache spark need to carry an extra migration patch when the
>>>> users who would be affected are already using the Databricks fork? I
>>>> don't see a situation where:
>>>>
>>>> A) legacy 3.5.x queries were using databricks-specific options
>>>> B) these users want to run the same queries in OSS spark today
>>>> C) the same people will not be using the databricks fork.
>>>>
>>>
>>> I don't understand why you couple this with Databricks Runtime.
>>>
>>> If I were pushing this to the case where the config was only released in
>>> Databricks Runtime and not yet to be released in Apache Spark, you are
>>> definitely right, although there are a bunch of other ways around to solve
>>> the issue on the vendor side, and I wouldn't choose this way which would be
>>> expected to have a lot of burden.
>>>
>>> This config was released to "Apache" Spark 3.5.4, so this is NO LONGER
>>> just a problem with vendor distribution. The breakage will happen even if
>>> someone does not even know about Databricks Runtime at all and keeps using
>>> Apache Spark. It just happens when users in Apache Spark 3.5.4 directly
>>> upgrade to Apache Spark 4.0.0+ (rather than upgrading once to Apache Spark
>>> 3.5.5+).
>>> (Again, from the vendor codebase, it is not a problem at all to have
>>> such a config and there is a much easier way to deal with it - migration
>>> logic is not needed at all.)
>>>
>>> I think this is a major misunderstanding that you see in this discussion
>>> to push vendor's code into OSS. Again, the migration logic is needed only
>>> by OSS. I can guarantee that I am not doing this for a vendor. (I made my
>>> first ASF contribution in 2014 and was a long time PMC member in the Apache
>>> Storm project. I'm betting more than 10 years of contribution on OSS.).
>>>
>>> Technically speaking, I'm effectively wasting my time to push the thing
>>> I think is the right way. I should be OK not to push this hard and it's not
>>> the best use of time from a vendor perspective. This is really just due
>>> diligence of the mistake, because I feel I need to do that.
>>>
>>>
>>>> Without a direct response to this, I think this discussion should be
>>>> considered to just be what it is on it's face -- a solution to a
>>>> vendors mistake and should not be ported to OSS spark.
>>>>
>>>> Thanks
>>>> Andrew
>>>>
>>>> On Mon, Mar 10, 2025 at 8:34 PM Jungtaek Lim
>>>> <kabhwan.opensou...@gmail.com> wrote:
>>>> >
>>>> > Please read through the explanation of how this impacts the OSS users
>>>> in the other branch of this discussion. This happened in "Apache" Spark
>>>> 3.5.4, and the migration logic has nothing to do with the vendor. This is
>>>> primarily to not break users in "Apache" Spark 3.5.4 who are willing to
>>>> upgrade directly to "Apache" Spark 4.0.0+.
>>>> >
>>>> > I'm not implying anything about compatibility between OSS and
>>>> vendors. I would remind that the problematic config name is not problematic
>>>> for the vendor, so the thing the vendor needs to do is just to alias
>>>> incorrect config name and new config name and done. The vendor does not
>>>> need to have such a migration code. This is just to do due diligence of
>>>> mitigating the mistake we have made. Again, if the migration logic does not
>>>> land to Apache Spark 4.0.x, only "Apache" Spark users will be impacted.
>>>> >
>>>> > On Tue, Mar 11, 2025 at 10:25 AM Andrew Melo <andrew.m...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hello Jungtaek,
>>>> >>
>>>> >> I'm not implying that this improves the vendors life. I'm just not
>>>> understanding the issue -- the downstream people started a stream with a
>>>> config option that the upstream people don't want to carry. If the affected
>>>> users are using the downstream fork (which is how they got the option),
>>>> then why can't those same downstream users not keep using their
>>>> vendor-provided downstream fork.
>>>> >>
>>>> >> To put it into less complicated words - presumably the people using
>>>> the databricks.* configs were already using the databricks runtime. Why
>>>> does Apache spark need to carry an extra migration patch when the users who
>>>> would be affected are already using the Databricks fork? I don't see a
>>>> situation where:
>>>> >>
>>>> >> A) legacy 3.5.x queries were using databricks-specific options
>>>> >> B) these users want to run the same queries in OSS spark today
>>>> >> C) the same people will not be using the databricks fork.
>>>> >>
>>>> >> This Venn diagram seems very small, and I don't think it justifies
>>>> carrying migration code for that one sliver of users.
>>>> >>
>>>> >> Andrew
>>>> >>
>>>> >> On Mon, Mar 10, 2025 at 5:29 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>> >> >
>>>> >> > One thing I can correct immediately is, downstream does not have
>>>> any impact at all from this. I believe I clarified that the config will not
>>>> be modified by anyone, so downstream there is nothing to change. The
>>>> problem is particular in OSS, downstream does not have any issue with this
>>>> leak at all.
>>>> >> > (One thing to clarify, config itself will be removed in Spark
>>>> 4.0.0. We only propose to retain migration logic.)
>>>> >> >
>>>> >> > I believe there is a huge misunderstanding - we are not proposing
>>>> this migration logic to ease the vendor's life, no it's not. If I don't
>>>> care about OSS, there is no incentive for me to propose this.
>>>> >> >
>>>> >> > I just wanted to do my best to remove any burden to users who are
>>>> innocent with this problem. If there is no migration logic, users will be
>>>> enforced to upgrade to Spark 3.5.5+ before upgrading to Spark 4.0.0+. Isn't
>>>> it bad enough? Why do we have to let users be bugging while we can avoid
>>>> it? The problem was introduced in the OSS community (I hope we don't blame
>>>> ourselves with mistakes. We are human.) so it is up to us to resolve this
>>>> properly. We don't have the right to break users' queries.
>>>> >> >
>>>> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo <andrew.m...@gmail.com>
>>>> wrote:
>>>> >> >>
>>>> >> >> Hello all
>>>> >> >>
>>>> >> >> As an outsider, I don't fully understand this discussion. This
>>>> >> >> particular configuration option "leaked" into the open-source
>>>> Spark
>>>> >> >> distribution, and now there is a lot of discussion about how to
>>>> >> >> mitigate existing workloads. But: presumably the people who are
>>>> >> >> depending on this configuration flag are already using a
>>>> downstream
>>>> >> >> (vendor-specific) fork, and a future update will similarly be
>>>> >> >> distributed by that downstream provider.
>>>> >> >>
>>>> >> >> Which people a) made a workflow using the vendor fork and b) want
>>>> to
>>>> >> >> resume it in the OSS version of spark?
>>>> >> >>
>>>> >> >> It seems like the people who are affected by this will already be
>>>> >> >> using someone else's fork, and there's no need to carry this
>>>> patch in
>>>> >> >> the mainline Spark code.
>>>> >> >>
>>>> >> >> For that reason, I believe the code should be dropped by OSS
>>>> Spark,
>>>> >> >> and vendors who need to mitigate it can push the appropriate
>>>> changes
>>>> >> >> to their downstreams.
>>>> >> >>
>>>> >> >> Thanks
>>>> >> >> Andrew
>>>>
>>>
>>
>> --
>> Adam Binford
>>
>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to