[jira] [Updated] (SPARK-53903) Performance degradation for PySpark apis at scale as compared to Scala apis

Sumit Kumar (Jira) Sat, 18 Oct 2025 15:31:32 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-53903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sumit Kumar updated SPARK-53903:
--------------------------------
    Attachment: testcase.ipynb

> Performance degradation for PySpark apis at scale as compared to Scala apis
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-53903
>                 URL: https://issues.apache.org/jira/browse/SPARK-53903
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: PySpark
>    Affects Versions: 3.5.1
>            Reporter: Sumit Kumar
>            Priority: Major
>         Attachments: testcase.ipynb
>
>
> Customers love PySpark and the flexibility of using several python libraries 
> as part of our workflows. I've a unique scenario where this specific usecase 
> has multiple tables with around 10k columns and some of those columns have 
> array datatype that when exploded, contain ~1k columns each.
> *Issues that we are facing:*
>  * Frequent driver OOM depending on the use case and how many columns are 
> involved in the logic and how many array type columns are exploded. There is 
> frequent GC, slowing down the workflows.
>  * We tried equivalent scala apis and the performance as well latency seemed 
> a lot better (no OOM and significantly less GC overheads).
> *Here is what we understand so far from thread and memory dumps:*
>  * driver ends up having open references for every pyspark object created in 
> the python vm because of py4j bridge-based communication implementation for 
> pyspark apis. and the garbage keeps accumulating on driver ultimately leading 
> to OOM
>  * even if we delete python references in pyspark code (for example:  del 
> df_dummy1) and run "gc.collect()" specifically, we are not able to ease the 
> memory pressure. Python gc or python triggered gc via py4j bridge in the 
> driver doesn't seem to be that good.
> This is not a typical workload but we have multiple such usecases and we are 
> debating if it's worth changing existing workflows to scala just like that 
> (existing DEs are more comfortable with PySpark, there is cost of migration 
> as well that we will have to convince our management to approve)
> *ASK:*
> This Jira includes a sample notebook that reproduces what our usecases see. 
> We are seeking community feedback on such a usecase and if there are ideas to 
> improve this situation further other than migrating to Scala apis. Any 
> PySpark improvement ideas that could help? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-53903) Performance degradation for PySpark apis at scale as compared to Scala apis

Reply via email to