Josh Rosen created SPARK-3358:
---------------------------------
Summary: PySpark worker fork()ing performance regression in m3.* /
PVM instances
Key: SPARK-3358
URL: https://issues.apache.org/jira/browse/SPARK-3358
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.1.0
Environment: m3.* instances on EC2
Reporter: Josh Rosen
SPARK-2764 (and some followup commits) simplified PySpark's worker process
structure by removing an intermediate pool of processes forked by daemon.py.
Previously, daemon.py forked a fixed-size pool of processes that shared a
socket and handled worker launch requests from Java. After my patch, this
intermediate pool was removed and launch requests are handled directly in
daemon.py.
Unfortunately, this seems to have increased PySpark task launch latency when
running on m3* class instances in EC2. Most of this difference can be
attributed to m3 instances' more expensive fork() system calls. I tried the
following microbenchmark on m3.xlarge and r3.xlarge instances:
{code}
import os
for x in range(1000):
if os.fork() == 0:
exit()
{code}
On the r3.xlarge instance:
{code}
real 0m0.761s
user 0m0.008s
sys 0m0.144s
{code}
And on m3.xlarge:
{code}
real 0m1.699s
user 0m0.012s
sys 0m1.008s
{code}
I think this is due to HVM vs PVM EC2 instances using different virtualization
technologies with different fork costs.
It may be the case that this performance difference only appears in certain
microbenchmarks and is masked by other performance improvements in PySpark,
such as improvements to large group-bys. I'm in the process of re-running
spark-perf benchmarks on m3 instances in order to confirm whether this impacts
more realistic jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]