Re: [DISCUSS] Ongoing projects for Spark 4.0

Ángel Wed, 22 Jan 2025 22:13:36 -0800

Hi,

I’m working on a performance issue that ends up throwing an
OutOfMemoryError when AQE is enabled. This problem was first identified by
Russel Jurney while running GraphFrames unit tests, as detailed in his gist
<https://gist.github.com/rjurney/6abeffbd59c67df5e5243c8f6619b6bf>. The
issue was also discussed in a related Spark mailing list thread
<https://lists.apache.org/thread/kl50ryobwqlr93s6zwkhjp9rjsqkpwk0>. The
problem has really nothing to do with GraphFrames specifically; instead, it
arises from how Spark internally generates a massive physical plan and
converts it into a String -the same plan, several times- during execution
with cached DataFrames and AQE enabled.


While I haven’t opened a Jira issue yet, I plan to do it shortly. Given its
potential to affect many use cases, I believe it would be beneficial to
address this issue in time for the Spark 4.0 release.


Regards,

Ángel





El mié, 22 ene 2025 a las 23:17, Mich Talebzadeh (<[email protected]>)
escribió:

> Interesting points: client server architecture has been around since the
> days of Sybase. A client written in any language, say Python, Scala makes a
> request to spark cluster. This remote access model inherently creates a
> level of isolation between the client application and the internal workings
> of the Spark cluster. So this brings in certain benefits. However, this
> client-server architecture introduces challenges when integrating features
> like SQL Scripting, which might require deeper integration with the Spark
> engine's internal mechanisms? Is that assertion correct and if so what are
> the challenges in a nutshell that you referred to?
>
> HTH,
>
> Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Wed, 22 Jan 2025 at 20:47, David Milicevic
> <[email protected]> wrote:
>
>> Hi all,
>>
>> Together with my team, I'm working on adding support for SQL Scripting (
>> JIRA <https://issues.apache.org/jira/browse/SPARK-48338>, Ref Spec
>> <https://docs.google.com/document/d/1uFv2VoqDoOH2k6HfgdkBYxp-Ou7Qi721eexNW6IYWwk/edit?pli=1&tab=t.0#heading=h.4cz970y1mk93>
>> ).
>> The feature is guarded by `spark.sql.scripting.enabled` SQL Conf because
>> it's still in development, but some features are already available in OSS
>> Spark - it's possible to execute scripts with all of the regular
>> statements, as well as with newly added control flow statements - IF/ELSE,
>> CASE, WHILE, REPEAT, LOOP, FOR, LEAVE, ITERATE, etc.
>> SQL Scripting still doesn't work with Spark Connect.
>>
>> Thanks,
>> David
>>
>> On Wed, Jan 22, 2025 at 12:25 PM Stefan Kandic
>> <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am working on adding collation support (
>>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-46830).
>>>
>>> Right now, collations are enabled by default as we have finished almost
>>> everything we planned to add. However, there are still some smaller things
>>> and improvements left that have ongoing efforts (setting default collation
>>> on table/view/schema etc.)
>>>
>>> Regards,
>>> Stefan
>>>
>>> On 2025/01/15 13:41:07 Wenchen Fan wrote:
>>> > Hi all,
>>> >
>>> > We have cut the "branch-4.0" and I'm sending this email to collect the
>>> > information for ongoing projects targeting Spark 4.0. Please reply to
>>> this
>>> > email to share the project progress with the community.
>>> >
>>> > Note that, the scheduled code freeze date is Feb 1, and RC1 cut date
>>> is Feb
>>> > 15.
>>> >
>>> > Thanks,
>>> > Wenchen
>>> >
>>>
>>

Re: [DISCUSS] Ongoing projects for Spark 4.0

Reply via email to