Josh Rosen created SPARK-3358:
---------------------------------

             Summary: PySpark worker fork()ing performance regression in m3.* / 
PVM instances
                 Key: SPARK-3358
                 URL: https://issues.apache.org/jira/browse/SPARK-3358
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.1.0
         Environment: m3.* instances on EC2
            Reporter: Josh Rosen


SPARK-2764 (and some followup commits) simplified PySpark's worker process 
structure by removing an intermediate pool of processes forked by daemon.py.  
Previously, daemon.py forked a fixed-size pool of processes that shared a 
socket and handled worker launch requests from Java.  After my patch, this 
intermediate pool was removed and launch requests are handled directly in 
daemon.py.

Unfortunately, this seems to have increased PySpark task launch latency when 
running on m3* class instances in EC2.  Most of this difference can be 
attributed to m3 instances' more expensive fork() system calls.  I tried the 
following microbenchmark on m3.xlarge and r3.xlarge instances:

{code}
import os

for x in range(1000):
  if os.fork() == 0:
    exit()
{code}

On the r3.xlarge instance:

{code}
real    0m0.761s
user    0m0.008s
sys     0m0.144s
{code}

And on m3.xlarge:

{code}
real    0m1.699s
user    0m0.012s
sys     0m1.008s
{code}

I think this is due to HVM vs PVM EC2 instances using different virtualization 
technologies with different fork costs.

It may be the case that this performance difference only appears in certain 
microbenchmarks and is masked by other performance improvements in PySpark, 
such as improvements to large group-bys.  I'm in the process of re-running 
spark-perf benchmarks on m3 instances in order to confirm whether this impacts 
more realistic jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to