Hi, I’m working on a performance issue that ends up throwing an OutOfMemoryError when AQE is enabled. This problem was first identified by Russel Jurney while running GraphFrames unit tests, as detailed in his gist <https://gist.github.com/rjurney/6abeffbd59c67df5e5243c8f6619b6bf>. The issue was also discussed in a related Spark mailing list thread <https://lists.apache.org/thread/kl50ryobwqlr93s6zwkhjp9rjsqkpwk0>. The problem has really nothing to do with GraphFrames specifically; instead, it arises from how Spark internally generates a massive physical plan and converts it into a String -the same plan, several times- during execution with cached DataFrames and AQE enabled.
While I haven’t opened a Jira issue yet, I plan to do it shortly. Given its potential to affect many use cases, I believe it would be beneficial to address this issue in time for the Spark 4.0 release. Regards, Ángel El mié, 22 ene 2025 a las 23:17, Mich Talebzadeh (<mich.talebza...@gmail.com>) escribió: > Interesting points: client server architecture has been around since the > days of Sybase. A client written in any language, say Python, Scala makes a > request to spark cluster. This remote access model inherently creates a > level of isolation between the client application and the internal workings > of the Spark cluster. So this brings in certain benefits. However, this > client-server architecture introduces challenges when integrating features > like SQL Scripting, which might require deeper integration with the Spark > engine's internal mechanisms? Is that assertion correct and if so what are > the challenges in a nutshell that you referred to? > > HTH, > > Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Wed, 22 Jan 2025 at 20:47, David Milicevic > <david.milice...@databricks.com.invalid> wrote: > >> Hi all, >> >> Together with my team, I'm working on adding support for SQL Scripting ( >> JIRA <https://issues.apache.org/jira/browse/SPARK-48338>, Ref Spec >> <https://docs.google.com/document/d/1uFv2VoqDoOH2k6HfgdkBYxp-Ou7Qi721eexNW6IYWwk/edit?pli=1&tab=t.0#heading=h.4cz970y1mk93> >> ). >> The feature is guarded by `spark.sql.scripting.enabled` SQL Conf because >> it's still in development, but some features are already available in OSS >> Spark - it's possible to execute scripts with all of the regular >> statements, as well as with newly added control flow statements - IF/ELSE, >> CASE, WHILE, REPEAT, LOOP, FOR, LEAVE, ITERATE, etc. >> SQL Scripting still doesn't work with Spark Connect. >> >> Thanks, >> David >> >> On Wed, Jan 22, 2025 at 12:25 PM Stefan Kandic >> <stefan.kan...@databricks.com.invalid> wrote: >> >>> Hi, >>> >>> I am working on adding collation support ( >>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-46830). >>> >>> Right now, collations are enabled by default as we have finished almost >>> everything we planned to add. However, there are still some smaller things >>> and improvements left that have ongoing efforts (setting default collation >>> on table/view/schema etc.) >>> >>> Regards, >>> Stefan >>> >>> On 2025/01/15 13:41:07 Wenchen Fan wrote: >>> > Hi all, >>> > >>> > We have cut the "branch-4.0" and I'm sending this email to collect the >>> > information for ongoing projects targeting Spark 4.0. Please reply to >>> this >>> > email to share the project progress with the community. >>> > >>> > Note that, the scheduled code freeze date is Feb 1, and RC1 cut date >>> is Feb >>> > 15. >>> > >>> > Thanks, >>> > Wenchen >>> > >>> >>