Xinrong Meng created SPARK-40281:
------------------------------------
Summary: Memory Profiler on Executors
Key: SPARK-40281
URL: https://issues.apache.org/jira/browse/SPARK-40281
Project: Spark
Issue Type: Umbrella
Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng
Profiling is critical to performance engineering. Memory consumption is a key
indicator of how efficient a PySpark program is. There is an existing effort on
memory profiling of Python progrms, Memory Profiler
([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]
PySpark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in the driver program. On the driver
side, PySpark is a regular Python process, thus, we can profile it as a normal
Python program using Memory Profiler.
However, on the executors side, we are missing such memory profiler. Since
executors are distributed on different nodes in the cluster, we need to need to
aggregate profiles. Furthermore, Python worker processes are spawned per
executor for the Python/Pandas UDF execution, which makes the memory profiling
more intricate.
The umbrella proposes to implement a Memory Profiler on Executors.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]