Sumit Kumar created SPARK-53903:
-----------------------------------
Summary: Performance degradation for PySpark apis at scale as
compared to Scala apis
Key: SPARK-53903
URL: https://issues.apache.org/jira/browse/SPARK-53903
Project: Spark
Issue Type: Brainstorming
Components: PySpark
Affects Versions: 3.5.1
Reporter: Sumit Kumar
Attachments: testcase.ipynb
Customers love PySpark and the flexibility of using several python libraries as
part of our workflows. I've a unique scenario where this specific usecase has
multiple tables with around 10k columns and some of those columns have array
datatype that when exploded, contain ~1k columns each.
*Issues that we are facing:*
* Frequent driver OOM depending on the use case and how many columns are
involved in the logic and how many array type columns are exploded. There is
frequent GC, slowing down the workflows.
* We tried equivalent scala apis and the performance as well latency seemed a
lot better (no OOM and significantly less GC overheads).
*Here is what we understand so far from thread and memory dumps:*
* driver ends up having open references for every pyspark object created in
the python vm because of py4j bridge-based communication implementation for
pyspark apis. and the garbage keeps accumulating on driver ultimately leading
to OOM
* even if we delete python references in pyspark code (for example: del
df_dummy1) and run "gc.collect()" specifically, we are not able to ease the
memory pressure. Python gc or python triggered gc via py4j bridge in the driver
doesn't seem to be that good.
This is not a typical workload but we have multiple such usecases and we are
debating if it's worth changing existing workflows to scala just like that
(existing DEs are more comfortable with PySpark, there is cost of migration as
well that we will have to convince our management to approve)
*ASK:*
This Jira includes a sample notebook that reproduces what our usecases see. We
are seeking community feedback on such a usecase and if there are ideas to
improve this situation further other than migrating to Scala apis. Any PySpark
improvement ideas that could help?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]