Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Dongjoon Hyun Tue, 04 Mar 2025 23:49:16 -0800

Technically, there is no agreement here. In other words, we have the same
situation with the initial discussion thread where we couldn't build a
community consensus on this.


> I will consider this as "lazy consensus" if there are no objections
> for 3 days from initiation of the thread.

If you need an explicit veto, here is mine, -1, because I don't think
that's just a string.

> the problematic config is just a "string",

To be clear, as I proposed both in the PR comments and initial discussion
thread, I believe we had better keep the AS-IS `master` and `branch-4.0`
and recommend to upgrade to the latest version of Apache Spark 3.5.x first
before upgrading to Spark 4.

Sincerely,
Dongjoon.


On Tue, Mar 4, 2025 at 8:37 PM Jungtaek Lim <[email protected]>
wrote:

> Bumping on this. Again, this is a blocker for Spark 4.0.0. I will consider
> this as "lazy consensus" if there are no objections for 3 days from
> initiation of the thread.
>
> On Tue, Mar 4, 2025 at 2:15 PM Jungtaek Lim <[email protected]>
> wrote:
>
>> Hi dev,
>>
>> This is a spin-up of the original thread "Deprecating and banning
>> `spark.databricks.*` config from Apache Spark repository". (link
>> <https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd>)
>>
>> From the original thread, we decided to deprecate the config in Spark
>> 3.5.5 and remove the config in Spark 4.0.0. That thread did not decide one
>> thing, about smooth migration logic.
>>
>> We "persist" the config into offset log for streaming query since the
>> value of the config must be consistent during the lifecycle of the query.
>> This means, the problematic config is already persisted for streaming query
>> which ever ran with Spark 3.5.4.
>>
>> For the migration logic, we re-assign the value of the problematic config
>> to the new config. This happens when the query is restarted, and it will be
>> reflected into an offset log for "newer batch" so after a couple new
>> microbatches the migration logic isn't needed. This migration logic is
>> shipped in Spark 3.5.5, so once the query is run with Spark 3.5.5 for a
>> couple microbatches, it will be mitigated.
>>
>> But I would say that there will always be a case that users just bump the
>> minor/major version without following all the bugfix versions. I think it
>> is still dangerous to remove the migration logic in Spark 4.0.0 (and
>> probably Spark 4.1.0, depending on the discussion). From the migration
>> logic, the problematic config is just a "string", and users wouldn't be
>> able to set the value with the problematic config name. We don't document
>> this, as it'll be done automatically.
>>
>> That said, I'd propose to have migration logic for Spark 4.0 version line
>> (at minimum, 4.1 is debatable). This will give a safer and less burden
>> migration path for users with just retaining a problematic "string" (again,
>> not a config).
>>
>> I'd love to hear the community's voice on this. I'd like to remind you,
>> this is a blocker for Spark 4.0.0.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

Reply via email to