Sumit Kumar created SPARK-53903:
-----------------------------------

             Summary: Performance degradation for PySpark apis at scale as 
compared to Scala apis
                 Key: SPARK-53903
                 URL: https://issues.apache.org/jira/browse/SPARK-53903
             Project: Spark
          Issue Type: Brainstorming
          Components: PySpark
    Affects Versions: 3.5.1
            Reporter: Sumit Kumar
         Attachments: testcase.ipynb

Customers love PySpark and the flexibility of using several python libraries as 
part of our workflows. I've a unique scenario where this specific usecase has 
multiple tables with around 10k columns and some of those columns have array 
datatype that when exploded, contain ~1k columns each.

*Issues that we are facing:*
 * Frequent driver OOM depending on the use case and how many columns are 
involved in the logic and how many array type columns are exploded. There is 
frequent GC, slowing down the workflows.
 * We tried equivalent scala apis and the performance as well latency seemed a 
lot better (no OOM and significantly less GC overheads).

*Here is what we understand so far from thread and memory dumps:*
 * driver ends up having open references for every pyspark object created in 
the python vm because of py4j bridge-based communication implementation for 
pyspark apis. and the garbage keeps accumulating on driver ultimately leading 
to OOM
 * even if we delete python references in pyspark code (for example:  del 
df_dummy1) and run "gc.collect()" specifically, we are not able to ease the 
memory pressure. Python gc or python triggered gc via py4j bridge in the driver 
doesn't seem to be that good.

This is not a typical workload but we have multiple such usecases and we are 
debating if it's worth changing existing workflows to scala just like that 
(existing DEs are more comfortable with PySpark, there is cost of migration as 
well that we will have to convince our management to approve)

*ASK:*

This Jira includes a sample notebook that reproduces what our usecases see. We 
are seeking community feedback on such a usecase and if there are ideas to 
improve this situation further other than migrating to Scala apis. Any PySpark 
improvement ideas that could help? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to