Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Jungtaek Lim Wed, 05 Mar 2025 12:55:11 -0800

I think it is how to handle the deprecation and removal.

If we leave the migration path for Spark 4.1.x, it will take more than "1
year of upgrade path" to be successful. From our release cadence, Spark
4.2.0 would probably be released March next year or later. And Spark 3.5.4
was released in December last year. Spark 4.0.x may not be very long, but
still provide 6+ months of upgrade path.


I'm not saying we should leave it forever. I'm saying we should try to
reduce the probability, likewise how projects handle the deprecation and
removal while trying to minimize the impact. I see you are predicting the
target to be not much, but that doesn't justify that we are free to do
nothing and make them bugged and dissatisfied with the project.

I see this is projected with "security fix" when we talk about severity,
but security fix does not restrict upgrading path, so what we are about to
do is much worse than that. I'm trying to make this a lot less worse.

I'm doing my best to care about users. Upgrading is not just a "one click",
even for bugfix versions.

On Thu, Mar 6, 2025 at 1:56 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Let me reformulate your suggestions and my interpretation.
>
> Option 1 "Adding back `spark.databricks.*` in Spark codebase and keep
> forever"
>
> If we follow the proposed logic and reasoning, it means there is no safe
> version to remove that configuration because Apache Spark 3.5.4 users can
> jump to any future releases like Spark 4.1.0, 4.2.0, and 5.0.0 technically.
> In other words, we cannot remove that logic forever.
>
> That's the reason why we couldn't make an agreement so far.
>
> Option 2 is simply adding a sentence (or more accurate one) for Spark
> 3.5.4 into the Spark 4.0.0 guideline because all other Spark versions
> (except 3.5.4) are not contaminated by `spark.databricks.*` conf.
>
> "For Spark 3.5.4 streaming jobs, if you want to migrate the existing
> running jobs, we need to upgrade them to Spark 3.5.5+ before upgrading
> Spark 4.0"
>
> Dongjoon.
>
>
> On Tue, Mar 4, 2025 at 11:11 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> Let's not start with VOTE right now, but let me make clear about options
>> and pros/cons for the option, so that people can choose one over another.
>>
>> Option 1 (Current proposal): retain migration logic for Spark 4.0 (and
>> maybe more minor versions, up to decision) which contains the problematic
>> config as "string".
>>
>> Pros: We can avoid breaking users' queries in any path of upgrade, as
>> long as we retain the migration logic. For example, we can support
>> upgrading the streaming query which ever ran in Spark 3.5.4 to Spark 4.0.x
>> as long as we decide to retain the migration logic for Spark 4.0. Spark
>> 4.1.x, Spark 4.2.x, etc. as long as we retain the migration path longer.
>> Cons: We retain the concerned config name in the codebase, though it's a
>> string and users can never set it.
>>
>> Option 2 (Dongjoon's proposal): do not bring the migration logic in Spark
>> 4.0 and force users to run existing streaming query with Spark 3.5.5+
>> before upgrading to Spark 4.0.0+.
>>
>> Pros: We stop retaining the concerned config name in the codebase.
>> Cons: Upgrading directly from Spark 3.5.4 to Spark 4.0+ will be missing
>> the critical QO fix, which can lead to a "broken" checkpoint. If the
>> checkpoint is broken, there is no way to restore and users have to restart
>> the query from scratch. Since the target workload is stateful, in the worst
>> case, the query has to start from the earliest data.
>>
>> I would only agree about the severity if the ASF project had a case of
>> vendor name in codebase and it was decided to pay whatever cost to fix the
>> case. I'm happy to be corrected if we have the doc in ASF explicitly
>> mentioning the case and action item.
>>
>> On Wed, Mar 5, 2025 at 3:51 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Shall we open an official vote for it? We can put more details on it so
>>> that people can vote:
>>> 1. how does it break user workloads without this migration code?
>>> 2. what is the Apache policy for leaked vendor names in the codebase? I
>>> think this is not the only one, we also mentioned
>>> `com.databricks.spark.csv` in
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L621C8-L621C32
>>>
>>> On Wed, Mar 5, 2025 at 2:40 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> One major question: How do you believe that we can enforce users on
>>>> upgrading path? I have seen a bunch of cases where users upgrade 2-3 minor
>>>> versions at once. Do you really believe we can just break their query?
>>>> What's the data backing up your claim?
>>>>
>>>> I think we agree to disagree. I really don't want "users" to get into
>>>> situations just because of us. It's regardless of who made the mistake -
>>>> it's about what's the proper mitigation for this, and I do not believe
>>>> enforcing users to upgrade to Spark 3.5.8+ before upgrading Spark 4.0 is a
>>>> valid approach.
>>>>
>>>> If I could vote for your alternative option, I'm -1 for it.
>>>>
>>>> On Wed, Mar 5, 2025 at 3:29 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Technically, there is no agreement here. In other words, we have the
>>>>> same situation with the initial discussion thread where we couldn't build 
>>>>> a
>>>>> community consensus on this.
>>>>>
>>>>> > I will consider this as "lazy consensus" if there are no objections
>>>>> > for 3 days from initiation of the thread.
>>>>>
>>>>> If you need an explicit veto, here is mine, -1, because I don't think
>>>>> that's just a string.
>>>>>
>>>>> > the problematic config is just a "string",
>>>>>
>>>>> To be clear, as I proposed both in the PR comments and initial
>>>>> discussion thread, I believe we had better keep the AS-IS `master` and
>>>>> `branch-4.0` and recommend to upgrade to the latest version of Apache 
>>>>> Spark
>>>>> 3.5.x first before upgrading to Spark 4.
>>>>>
>>>>> Sincerely,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Tue, Mar 4, 2025 at 8:37 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Bumping on this. Again, this is a blocker for Spark 4.0.0. I will
>>>>>> consider this as "lazy consensus" if there are no objections for 3 days
>>>>>> from initiation of the thread.
>>>>>>
>>>>>> On Tue, Mar 4, 2025 at 2:15 PM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi dev,
>>>>>>>
>>>>>>> This is a spin-up of the original thread "Deprecating and banning
>>>>>>> `spark.databricks.*` config from Apache Spark repository". (link
>>>>>>> <https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd>)
>>>>>>>
>>>>>>> From the original thread, we decided to deprecate the config in
>>>>>>> Spark 3.5.5 and remove the config in Spark 4.0.0. That thread did not
>>>>>>> decide one thing, about smooth migration logic.
>>>>>>>
>>>>>>> We "persist" the config into offset log for streaming query since
>>>>>>> the value of the config must be consistent during the lifecycle of the
>>>>>>> query. This means, the problematic config is already persisted for
>>>>>>> streaming query which ever ran with Spark 3.5.4.
>>>>>>>
>>>>>>> For the migration logic, we re-assign the value of the problematic
>>>>>>> config to the new config. This happens when the query is restarted, and 
>>>>>>> it
>>>>>>> will be reflected into an offset log for "newer batch" so after a couple
>>>>>>> new microbatches the migration logic isn't needed. This migration logic 
>>>>>>> is
>>>>>>> shipped in Spark 3.5.5, so once the query is run with Spark 3.5.5 for a
>>>>>>> couple microbatches, it will be mitigated.
>>>>>>>
>>>>>>> But I would say that there will always be a case that users just
>>>>>>> bump the minor/major version without following all the bugfix versions. 
>>>>>>> I
>>>>>>> think it is still dangerous to remove the migration logic in Spark 4.0.0
>>>>>>> (and probably Spark 4.1.0, depending on the discussion). From the 
>>>>>>> migration
>>>>>>> logic, the problematic config is just a "string", and users wouldn't be
>>>>>>> able to set the value with the problematic config name. We don't 
>>>>>>> document
>>>>>>> this, as it'll be done automatically.
>>>>>>>
>>>>>>> That said, I'd propose to have migration logic for Spark 4.0 version
>>>>>>> line (at minimum, 4.1 is debatable). This will give a safer and less 
>>>>>>> burden
>>>>>>> migration path for users with just retaining a problematic "string" 
>>>>>>> (again,
>>>>>>> not a config).
>>>>>>>
>>>>>>> I'd love to hear the community's voice on this. I'd like to remind
>>>>>>> you, this is a blocker for Spark 4.0.0.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to