Though if we are OK with disturbing users to read the migration guide to figure out the change for the case of direct upgrade to Spark 4.0.0+, I agree this is also one of the valid ways.
On Wed, Feb 19, 2025 at 9:20 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > The point is, why can't we remove it from Spark 3.5.5 as well if we are > planning to "remove" (not deprecate) at the very next minor release? > > The logic of migration just works without having the incorrect config key > to be indicated with SQL config key. > > That said, the point we debate here is only valid when we want to let > users keep setting the value of the config manually for some time. I'd > argue users would never set this manually, but let's put this aside for now. > > What is the proper "some time" here? If we deprecate this in Spark 3.5.5 > and remove it to Spark 4.0.0, we are going to break the case of setting the > config manually when they are upgrading Spark 3.5.4 to Spark 4.0.0 > directly. I have no idea how we construct the message to recommend > upgrading from Spark 3.5.4 to Spark 3.5.5, but if it's not strong enough, > there will always be a case who directly jumps major/minor versions, and > for them, effectively we do not deprecate the config very well. > > I suspect there would be a huge difference if we just remove it in Spark > 3.5.5 and only support the migration case. That has been something I > claimed if we really want to kick the incorrect config out ASAP. If we want > to do this gracefully, I don't feel like removing this in Spark 4.0.0 is > giving users enough time to indicate the config is deprecated and will be > removed very soon. > > On Wed, Feb 19, 2025 at 2:24 AM Holden Karau <holden.ka...@gmail.com> > wrote: > >> I think that removing in 4 sounds reasonable to me as well. It’s >> important to create a sense of fairness among vendors. >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> I don't think there is a reason to keep it at 4.0.0 (and forever?) if we >>> release Spark 3.5.5 with the proper deprecation. This is a big difference, >>> Wenchen. >>> >>> And, the difference is the main reason why I initiated this thread to >>> sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via >>> volunteering Spark 3.5.5 release manager. >>> >>> Sincerely, >>> Dongjoon >>> >>> >>> On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> It’s unfortunate that we missed identifying these issues during the >>>> code review. However, since they have already been released, I believe >>>> deprecating them is a better approach than removing them, as the latter >>>> would introduce a breaking change. >>>> >>>> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>, >>>> it looks like there are only a few lines of migration code. Would it be >>>> acceptable to leave them for legacy support? With the new config name style >>>> check rule in place, such issues should not occur again in the future. >>>> >>>> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >>>>> I think I can add a color to minimize the concern. >>>>> >>>>> The problematic config we added is arguably not user facing. I'd argue >>>>> moderate users wouldn't even understand what the flag is doing. The config >>>>> was added because Structured Streaming has been leveraging SQL config to >>>>> "do the magic" on having two different default values for new query vs old >>>>> query (checkpoint is created from the version where the fix is not >>>>> landed). >>>>> This is purely used for backward compatibility, not something we want to >>>>> give users flexibility. >>>>> >>>>> That said, I don't see a risk of removing config "at any point". (I'd >>>>> even say removing this config in Spark 3.5.5 does not change anything. The >>>>> reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just >>>>> to address any concern on being conservative.) >>>>> >>>>> I think you are worrying about case 1 from my comment. From my new >>>>> change (link <https://github.com/apache/spark/pull/49983>), I made a >>>>> migration logic when the offset log contains the problematic configuration >>>>> - we will take the value, but put the value to the new config, and at the >>>>> next microbatch planning, the offset log will contain the new >>>>> configuration >>>>> going forward. This addresses the case 1, as long as we retain the >>>>> migration logic for a couple minor releases (say, 4.2 or so). We just need >>>>> to support this migration logic for the time where we never thought of >>>>> jumping directly from Spark 3.5.4 to the version. >>>>> >>>>> Hope this helps to address your concern/worrying. >>>>> >>>>> >>>>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen < >>>>> bjornjorgen...@gmail.com> wrote: >>>>> >>>>>> >>>>>> Having breaking changes in a minor seems not that good.. As I'm >>>>>> reading this, >>>>>> >>>>>> "*This could break the query if the rule impacts the query, because >>>>>> the effectiveness of the fix is flipped.*" >>>>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140 >>>>>> >>>>>> >>>>>> What if we have this https://github.com/apache/spark/pull/48149 >>>>>> change in the branch and remove it only for version 4? That way we dont >>>>>> break anything. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun < >>>>>> dongjoon.h...@gmail.com>: >>>>>> >>>>>>> Hi, All. >>>>>>> >>>>>>> I'd like to highlight this discussion because this is more important >>>>>>> and tricky in a way. >>>>>>> >>>>>>> As already mentioned in the mailing list and PRs, there was an >>>>>>> obvious mistake >>>>>>> which missed an improper configuration name, `spark.databricks.*`. >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424 >>>>>>> >>>>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` >>>>>>> >>>>>>> In fact, Apache Spark committers have been preventing this >>>>>>> repetitive mistake >>>>>>> pattern during the review stages successfully until we slip the >>>>>>> following backportings >>>>>>> at Apache Spark 3.5.4. >>>>>>> >>>>>>> https://github.com/apache/spark/pull/45649 >>>>>>> https://github.com/apache/spark/pull/48149 >>>>>>> https://github.com/apache/spark/pull/49121 >>>>>>> >>>>>>> At this point of writing, `spark.databricks.*` was removed >>>>>>> successfully from `master` >>>>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect >>>>>>> Apache Spark repository >>>>>>> from future mistakes. >>>>>>> >>>>>>> SPARK-51172 Rename to >>>>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan >>>>>>> SPARK-51173 Add `configName` Scalastyle rule >>>>>>> >>>>>>> What I proposed is to release Apache Spark 3.5.5 next week with the >>>>>>> deprecation >>>>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*` >>>>>>> configuration. >>>>>>> >>>>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with >>>>>>> alternative) >>>>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config) >>>>>>> >>>>>>> In addition, I'd like to volunteer as a release manager of Apache >>>>>>> Spark 3.5.5 >>>>>>> for a swift release. WDYT? >>>>>>> >>>>>>> FYI, `branch-3.5` has 37 patches currently. >>>>>>> >>>>>>> $ git log --oneline v3.5.4..HEAD | wc -l >>>>>>> 37 >>>>>>> >>>>>>> Best Regards, >>>>>>> Dongjoon. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Bjørn Jørgensen >>>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>>>>> Norge >>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>>>>> >>>>>> +47 480 94 297 >>>>>> >>>>>