I think that removing in 4 sounds reasonable to me as well. It’s important to create a sense of fairness among vendors.
Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Tue, Feb 18, 2025 at 11:22 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > I don't think there is a reason to keep it at 4.0.0 (and forever?) if we > release Spark 3.5.5 with the proper deprecation. This is a big difference, > Wenchen. > > And, the difference is the main reason why I initiated this thread to > sugguest to remove 'spark.databricks.*' completely from Apache Spark 4 via > volunteering Spark 3.5.5 release manager. > > Sincerely, > Dongjoon > > > On Mon, Feb 17, 2025 at 22:59 Wenchen Fan <cloud0...@gmail.com> wrote: > >> It’s unfortunate that we missed identifying these issues during the code >> review. However, since they have already been released, I believe >> deprecating them is a better approach than removing them, as the latter >> would introduce a breaking change. >> >> Regarding Jungtaek’s PR <https://github.com/apache/spark/pull/49983>, it >> looks like there are only a few lines of migration code. Would it be >> acceptable to leave them for legacy support? With the new config name style >> check rule in place, such issues should not occur again in the future. >> >> On Tue, Feb 18, 2025 at 9:00 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> I think I can add a color to minimize the concern. >>> >>> The problematic config we added is arguably not user facing. I'd argue >>> moderate users wouldn't even understand what the flag is doing. The config >>> was added because Structured Streaming has been leveraging SQL config to >>> "do the magic" on having two different default values for new query vs old >>> query (checkpoint is created from the version where the fix is not landed). >>> This is purely used for backward compatibility, not something we want to >>> give users flexibility. >>> >>> That said, I don't see a risk of removing config "at any point". (I'd >>> even say removing this config in Spark 3.5.5 does not change anything. The >>> reason I'm not removing the config in 3.5 (and yet to 4.0/master) is just >>> to address any concern on being conservative.) >>> >>> I think you are worrying about case 1 from my comment. From my new >>> change (link <https://github.com/apache/spark/pull/49983>), I made a >>> migration logic when the offset log contains the problematic configuration >>> - we will take the value, but put the value to the new config, and at the >>> next microbatch planning, the offset log will contain the new configuration >>> going forward. This addresses the case 1, as long as we retain the >>> migration logic for a couple minor releases (say, 4.2 or so). We just need >>> to support this migration logic for the time where we never thought of >>> jumping directly from Spark 3.5.4 to the version. >>> >>> Hope this helps to address your concern/worrying. >>> >>> >>> On Tue, Feb 18, 2025 at 7:40 AM Bjørn Jørgensen < >>> bjornjorgen...@gmail.com> wrote: >>> >>>> >>>> Having breaking changes in a minor seems not that good.. As I'm reading >>>> this, >>>> >>>> "*This could break the query if the rule impacts the query, because >>>> the effectiveness of the fix is flipped.*" >>>> https://github.com/apache/spark/pull/49897#issuecomment-2652567140 >>>> >>>> >>>> What if we have this https://github.com/apache/spark/pull/48149 change >>>> in the branch and remove it only for version 4? That way we dont break >>>> anything. >>>> >>>> >>>> >>>> >>>> man. 17. feb. 2025 kl. 23:03 skrev Dongjoon Hyun < >>>> dongjoon.h...@gmail.com>: >>>> >>>>> Hi, All. >>>>> >>>>> I'd like to highlight this discussion because this is more important >>>>> and tricky in a way. >>>>> >>>>> As already mentioned in the mailing list and PRs, there was an obvious >>>>> mistake >>>>> which missed an improper configuration name, `spark.databricks.*`. >>>>> >>>>> >>>>> https://github.com/apache/spark/blob/a6f220d951742f4074b37772485ee0ec7a774e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3424 >>>>> >>>>> `spark.databricks.sql.optimizer.pruneFiltersCanPruneStreamingSubplan` >>>>> >>>>> In fact, Apache Spark committers have been preventing this repetitive >>>>> mistake >>>>> pattern during the review stages successfully until we slip the >>>>> following backportings >>>>> at Apache Spark 3.5.4. >>>>> >>>>> https://github.com/apache/spark/pull/45649 >>>>> https://github.com/apache/spark/pull/48149 >>>>> https://github.com/apache/spark/pull/49121 >>>>> >>>>> At this point of writing, `spark.databricks.*` was removed >>>>> successfully from `master` >>>>> and `branch-4.0` and a new ScalaStyle rule was added to protect Apache >>>>> Spark repository >>>>> from future mistakes. >>>>> >>>>> SPARK-51172 Rename to >>>>> spark.sql.optimizer.pruneFiltersCanPruneStreamingSubplan >>>>> SPARK-51173 Add `configName` Scalastyle rule >>>>> >>>>> What I proposed is to release Apache Spark 3.5.5 next week with the >>>>> deprecation >>>>> in order to make Apache Spark 4.0 be free of `spark.databricks.*` >>>>> configuration. >>>>> >>>>> Apache Spark 3.5.5 (2025 February, with deprecation warning with >>>>> alternative) >>>>> Apache Spark 4.0.0 (2025 March, without `spark.databricks.*` config) >>>>> >>>>> In addition, I'd like to volunteer as a release manager of Apache >>>>> Spark 3.5.5 >>>>> for a swift release. WDYT? >>>>> >>>>> FYI, `branch-3.5` has 37 patches currently. >>>>> >>>>> $ git log --oneline v3.5.4..HEAD | wc -l >>>>> 37 >>>>> >>>>> Best Regards, >>>>> Dongjoon. >>>>> >>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>>> Norge >>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >>>> >>>> +47 480 94 297 >>>> >>>