Hi dev,

This is a spin-up of the original thread "Deprecating and banning
`spark.databricks.*` config from Apache Spark repository". (link
<https://lists.apache.org/thread/qwxb21g5xjl7xfp4rozqmg1g0ndfw2jd>)

>From the original thread, we decided to deprecate the config in Spark 3.5.5
and remove the config in Spark 4.0.0. That thread did not decide one thing,
about smooth migration logic.

We "persist" the config into offset log for streaming query since the value
of the config must be consistent during the lifecycle of the query. This
means, the problematic config is already persisted for streaming query
which ever ran with Spark 3.5.4.

For the migration logic, we re-assign the value of the problematic config
to the new config. This happens when the query is restarted, and it will be
reflected into an offset log for "newer batch" so after a couple new
microbatches the migration logic isn't needed. This migration logic is
shipped in Spark 3.5.5, so once the query is run with Spark 3.5.5 for a
couple microbatches, it will be mitigated.

But I would say that there will always be a case that users just bump the
minor/major version without following all the bugfix versions. I think it
is still dangerous to remove the migration logic in Spark 4.0.0 (and
probably Spark 4.1.0, depending on the discussion). From the migration
logic, the problematic config is just a "string", and users wouldn't be
able to set the value with the problematic config name. We don't document
this, as it'll be done automatically.

That said, I'd propose to have migration logic for Spark 4.0 version line
(at minimum, 4.1 is debatable). This will give a safer and less burden
migration path for users with just retaining a problematic "string" (again,
not a config).

I'd love to hear the community's voice on this. I'd like to remind you,
this is a blocker for Spark 4.0.0.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply via email to