A recent SPIP 
<https://docs.google.com/document/d/1Nphejrf_vh4YRECn0JPgKClqxDS_lB6wufZFJQxyY98/>
 proposed to improve Spark’s performance on small and local datasets. On that 
SPIP I raised a related issue 
<https://docs.google.com/document/d/1Nphejrf_vh4YRECn0JPgKClqxDS_lB6wufZFJQxyY98/edit?disco=AAAB5rOuVBw>
 that I would like to surface here, and that is the time it takes to create a 
Spark session locally.

import time
from pyspark.sql import SparkSession

start = time.perf_counter()
session = (
    SparkSession.builder.remote("local[*]")
    .getOrCreate()
)
elapsed = time.perf_counter() - start
print(f"SparkSession startup: {elapsed:.3f}s")
On my M2 MacBook this consistently takes ~3 seconds.

If you’re working on an application that uses Spark and have a local dev/test 
loop setup, every loop will incur this startup cost. This makes the entire 
experience feel incredibly sluggish.

A straightforward solution is to start a persistent Connect server using 
sbin/start-connect-server.sh and set your remote to sc://localhost:{port}. In 
my testing, this cuts the startup time from ~3 seconds to <1 second.

This is good, but as a solution it has some problems:

It’s not discoverable. Users are unlikely to figure this out by themselves.
It’s not the default behavior. Users ideally should not have anything to figure 
out at all. It should just work like this in the background.
A background process needs some tooling to help manage.
I think we can address these problems by doing something like this:

Make .remote("local") create a persistent Connect server in the background by 
default, and restart/reconnect to it as needed.
Add a basic CLI to manage the Connect server, like spark connect {start | stop 
| show}. This CLI can perhaps just be a wrapper around the scripts in sbin/.
There are some details to figure out related to server idle timeouts, server 
discoverability, etc. But before exploring this further with a prototype, I 
wanted to get a reaction from the list.

What do you think?

Nick

Reply via email to